LinkedIn Datahub Application Architecture Quick Understanding

Liangjun Jiang
2 min readApr 18, 2020

Lately I have been exposed to LinkedIn Datahub github open source project. This project is a generalized metadata search & discovery tool, and has gained some momentum in the past year.

Thanks for being here. I just purchased a beach house at South Padre Island (SPI), TX, and used it as a short term rental property (Airbnb or Vrbo). SPI has the US top 10 beach, and is 7 miles away from SpaceX’s Mars Launch Base. OK, 7-mile is the point to point distance. You can check this house out from my property management’s website: https://spirentals.com/property-info/468183.html . You can also visit my website https://firststr.com for the details about this house and amenities.

In its document, it has a nice diagram show its architecture as such

https://github.com/linkedin/datahub/blob/master/docs/imgs/datahub-architecture.png

Well, to me, it is still not that clear. for example, what’s Metadata Store? What is the difference between MAE and MCE. After read more, I come up with the following sequence diagram for me to understand. I actually like sequence diagram, it definitely tells more information.

LinkedIn Datahub Sequence Diagram
  1. It starts an ETL scripts to find metadata of data source, and publish metadata in Avro data format to MetadataChangeEvent Kafka topic.
  2. A MetadataChangeEvent(MCE) Processor pulls Avro data, and validate something called Urn, and send to Generalized Metadata Service(GMS), GMS persists the metadata to MySQL.
  3. GMS also checks received metadata to find out whether there is a previous version. Publish the difference to Kafka’s MetadataAuditEvent topic.
  4. the MetadataAuditEvent(MAE) processor pulls MetadataAuditEvent, and persist to Neo4j & Elastic Search(ES).
  5. The frontend of Datahub talks to the metadata services ( a bunch of APIs) of GMS. The list of APIs can be found from GMS’s page.

I hope this helps a little bit if you come across that project, and try to understand how metadata flows across different systems.

--

--