Deploy Open Source Datahub
I have written a few blogs about Open Source Project: Datahub in the past years (mainly in 2020). Since then, the Datahub has evolved quite a lot, some of my past writing definitely is outdated.
If you don’t know what’s the Open Source Datahub, it’s an open-source metadata platform for the modern data stack.
Today I will focus on the deployment of this project, specifically, without using Kubernetes.
Datahub’s Architecture
Look at the code bases, Datahub itself can be counted as 5 services:
- Datahub UI — a React.js app
- Datahub Frontend — A Java app with Play Framework
- Datahub GMS — A Java backend app
- Datahub Metadata Change Event (MCE) Consumer App — a Kafka consumer app
- Datahub Metadata Audit Event (MAE) Consumer App — a Kafka consumer app
Once you are in the deployment stage, the services can be simplified as such
- the UI and Frontend will be one docker image: datahub frontend
- the GMS backend module including MAE and MCE
Of course, you can also deploy MAE and MCE separately so it will be 4 services
- Datahub frontend
- Datahub GMS
- Datahub MCE
- Datahub MAE
Should you choose deploying MCE and MAE integrated with GMS? or separately? Really depends on you expect how much you will have metadata ingested into your datahub, say, hourly. In general, I think the datahub is for metadata and you should not expect too much metadata.
Datahub also needs the following external services :
- Elastic Search
- A Graph Database (you can choose Elastic Search, Neo4j or DGraph)
- A relational Database (MySQL, PostgreSQL, etc)
- Kafka
Once again, if you choose Elastic Search as your graph database choice, you will need those 3 services
- Elastic Search
- MySQL
- Kafka
The Deployment Preparation
Needless to say, Datahub has built those docker images, and has docker compose files (many), different scripts for you to get it started quickly.
For example, you can run this compose file to start the service using Elastic Search as Graph database
docker compose -f docker-compose-without-neo4j.yml up
This YAML file is using this gms docker env file so the MCE and MAE are included in the GMS
MAE_CONSUMER_ENABLED=true
MCE_CONSUMER_ENABLED=true
Now you have the minimum services count.
The Deployment
Datahub project does have Helm chart to deploy all services with Kubernetes. You can use AWS/GCP/Azure Kubernetes service to get it going quickly.
In a near real environment, your company probably has dedicated teams to manage services such as Elastic Search, Kafka and relational databases. For example, your company’s cloud infra team has provisioned those services with AWS, Azure or GCP. They will provide information such as Elastic Search Host address, SSL enabled, and the Kafka cluster (might be the Confluent Kafka or Open source Kafka) and the MySQL databases.
Most likely your responsibility will be to make sure your Datahub-Frontend and Datahub-GMS services can connect those provisioned MySQL, Kafka and Elastic Search.
Of course, you also know you have to do some setup with MySQL, Kafka and Elastic Search.
Yes, Datahub also created different setup dockers images for this purpose. Again, I think you don’t need to have to run a docker image to do that, even though your cloud infra team allows you to do so.
MySQL Preparation
You need to create a database, and create some table schemas under this database, also seed a user. This link is the init sql.
Elastic Search (ES) Preparation
ES preparation is much more complicated. If you looked at this create-indices.sh, you will notice some indexes such as datahub_usage_event can be created with this script. But for most important entities, those index are created (updated) on the fly. It means, the gms
service will talk to your ES to find if those indexes are there or not. If not, it will create them. If not up to date, it will update them.
To get an idea, here is the list of indexes. It is getting long.
green open datajobindex_v2_1641502047059 f0kNpMD0RQaqzFJH89i97A 1 1 0 0 566b 283bgreen open datajob_datahubingestioncheckpointaspect_v1 hly76_S2Rz-Du626BnGQkw 1 1 0 0 566b 283bgreen open dashboardindex_v2_1641502057683 LRzBB2-nRPGQ8iQuE0oeFA 1 1 0 0 566b 283bgreen open corpdata_osdatahubglossarytermdocument Vn-maiBVQAOUIV_p8xfNXg 1 1 0 0 566b 283bgreen open tagindex_v2_1641502050101 ccIMh53xQPuu3C0wdsFQwg 1 1 0 0 566b 283bgreen open mlfeatureindex_v2_1641502044111 aXzeJnmLTMiOAWDUzqJhlg 1 1 0 0 566b 283bgreen open mlmodelindex_v2_1641502036576 F8dbw2EVSL-WVbw_5d0sIQ 1 1 0 0 566b 283bgreen open corpgroupindex_v2_1641502035072 RMk9-RJiQlWxiPJy4VFlTw 1 1 2 0 24.5kb 18.8kbgreen open glossarytermindex_v2_1641502051585 F2KVEKmPRz2m70WkSsDF1g 1 1 0 0 566b 283bgreen open datahubretentionindex_v2_1644255720451 StdL8srTTl6wVJifa1nbyA 1 1 0 0 566b 283bgreen open dataset_datasetusagestatisticsaspect_v1_1641502061808 4Kg9W1-cSPePzvVcUUnLBw 1 1 0 0 566b 283bgreen open mlfeaturetableindex_v2_1641502039615 2iGd0dGIRrGaArSJkmiz2A 1 1 0 0 566b 283bgreen open datajob_datahubingestionrunsummaryaspect_v1 CDNmMTRDTeqcBh-7XiX17A 1 1 0 0 566b 283bgreen open mlprimarykeyindex_v2_1641502053085 aOjDjKrCSQudh6FSVtmMkw 1 1 0 0 566b 283bgreen open dataset_datasetprofileaspect_v1 8PefuaVqRI2OWMO1Ze7i_g 1 1 0 0 566b 283bgreen open mlmodeldeploymentindex_v2_1644255718602 Zybk2-3xSYyPAXOllgw8zw 1 1 0 0 566b 283bgreen open mlmodelgroupindex_v2_1641502041169 w86nnQvsR2WdIekdSIuOiw 1 1 0 0 566b 283bgreen open corpuserindex_v2_1644255724994 BJoL8M_kSfObNpR7cmjsNQ 1 1 2 0 14.2kb 7.1kbgreen open dataflowindex_v2_1641502045582 3ibt0SjyT0aiA5Sy1gYlUQ 1 1 0 0 566b 283bgreen open graph_service_v1 gi9y5hM8T4aOreu3-NCd6w 1 1 0 0 566b 283bgreen open system_metadata_service_v1_1641502059168 C_pbCmO9T9eSg6ecYYt94A 1 1 9679 0 2mb 1mbgreen open chartindex_v2_1641502056160 V5Ep92SlTrW1UWIW2Toh9A 1 1 0 0 566b 283bgreen open dataprocessindex_v2_1641502038108 F6zqo_PYTQu8edBWHdzXxQ 1 1 0 0 566b 283bgreen open glossarynodeindex_v2_1641502042625 oS_yIopwSTyyXtbZxoGrTA 1 1 0 0 566b 283bgreen open schemafieldindex_v2_1641502048525 fPwcSr8GSiet6uDFR9X9ow 1 1 0 0 566b 283bgreen open dataplatformindex_v2_1644255721972 VWoJNIn9Qimg6eWShuom8g 1 1 0 0 566b 283bgreen open datahubpolicyindex_v2_1644255723467 pCQek1yHQcKwdd3SNVo5SQ 1 1 0 0 566b 283bgreen open datasetindex_v2_1641502054631 uEoYrqfxTPek-1pn9gwcOg 1 1 1622 1001 3.2mb 1.7mb
Kafka Preparation
Datahub created this Kafka-setup.sh to understand what topics are needed to be created beforehand. Since most entities uses Avro data format, you might need to register the avro schema with your schema registry.
Once you have built the Datahub project successfully, you will find some avsc files
With all those setup done, I think your datahub-frontend and datahub-gms are ready to go.
Issues I have seen
Since this blog is long enough, I will use another blog to talk about some issues I have seen, and how I fix them. You can find the Part II here.