Deploy Open Source Datahub

5 min readFeb 8, 2022

I have written a few blogs about Open Source Project: Datahub in the past years (mainly in 2020). Since then, the Datahub has evolved quite a lot, some of my past writing definitely is outdated.

If you don’t know what’s the Open Source Datahub, it’s an open-source metadata platform for the modern data stack.

Today I will focus on the deployment of this project, specifically, without using Kubernetes.

Datahub’s Architecture

Datahub Service Components, image source

Look at the code bases, Datahub itself can be counted as 5 services:

Datahub UI — a React.js app
Datahub Frontend — A Java app with Play Framework
Datahub GMS — A Java backend app
Datahub Metadata Change Event (MCE) Consumer App — a Kafka consumer app
Datahub Metadata Audit Event (MAE) Consumer App — a Kafka consumer app

Once you are in the deployment stage, the services can be simplified as such

the UI and Frontend will be one docker image: datahub frontend
the GMS backend module including MAE and MCE

Of course, you can also deploy MAE and MCE separately so it will be 4 services

Datahub frontend
Datahub GMS
Datahub MCE
Datahub MAE

Should you choose deploying MCE and MAE integrated with GMS? or separately? Really depends on you expect how much you will have metadata ingested into your datahub, say, hourly. In general, I think the datahub is for metadata and you should not expect too much metadata.

Datahub also needs the following external services :

Elastic Search
A Graph Database (you can choose Elastic Search, Neo4j or DGraph)
A relational Database (MySQL, PostgreSQL, etc)
Kafka

Once again, if you choose Elastic Search as your graph database choice, you will need those 3 services

Elastic Search
MySQL
Kafka

The Deployment Preparation

Needless to say, Datahub has built those docker images, and has docker compose files (many), different scripts for you to get it started quickly.

For example, you can run this compose file to start the service using Elastic Search as Graph database

docker compose -f docker-compose-without-neo4j.yml up

This YAML file is using this gms docker env file so the MCE and MAE are included in the GMS

MAE_CONSUMER_ENABLED=true
MCE_CONSUMER_ENABLED=true

Now you have the minimum services count.

The Deployment

Datahub project does have Helm chart to deploy all services with Kubernetes. You can use AWS/GCP/Azure Kubernetes service to get it going quickly.

In a near real environment, your company probably has dedicated teams to manage services such as Elastic Search, Kafka and relational databases. For example, your company’s cloud infra team has provisioned those services with AWS, Azure or GCP. They will provide information such as Elastic Search Host address, SSL enabled, and the Kafka cluster (might be the Confluent Kafka or Open source Kafka) and the MySQL databases.

Most likely your responsibility will be to make sure your Datahub-Frontend and Datahub-GMS services can connect those provisioned MySQL, Kafka and Elastic Search.

Of course, you also know you have to do some setup with MySQL, Kafka and Elastic Search.

Yes, Datahub also created different setup dockers images for this purpose. Again, I think you don’t need to have to run a docker image to do that, even though your cloud infra team allows you to do so.

MySQL Preparation

You need to create a database, and create some table schemas under this database, also seed a user. This link is the init sql.

Elastic Search (ES) Preparation

ES preparation is much more complicated. If you looked at this create-indices.sh, you will notice some indexes such as datahub_usage_event can be created with this script. But for most important entities, those index are created (updated) on the fly. It means, the gms service will talk to your ES to find if those indexes are there or not. If not, it will create them. If not up to date, it will update them.

To get an idea, here is the list of indexes. It is getting long.

green open datajobindex_v2_1641502047059                         f0kNpMD0RQaqzFJH89i97A 1 1      0     0   566b    283bgreen open datajob_datahubingestioncheckpointaspect_v1           hly76_S2Rz-Du626BnGQkw 1 1      0     0   566b    283bgreen open dashboardindex_v2_1641502057683                       LRzBB2-nRPGQ8iQuE0oeFA 1 1      0     0   566b    283bgreen open corpdata_osdatahubglossarytermdocument                            Vn-maiBVQAOUIV_p8xfNXg 1 1      0     0   566b    283bgreen open tagindex_v2_1641502050101                             ccIMh53xQPuu3C0wdsFQwg 1 1      0     0   566b    283bgreen open mlfeatureindex_v2_1641502044111                       aXzeJnmLTMiOAWDUzqJhlg 1 1      0     0   566b    283bgreen open mlmodelindex_v2_1641502036576                         F8dbw2EVSL-WVbw_5d0sIQ 1 1      0     0   566b    283bgreen open corpgroupindex_v2_1641502035072                       RMk9-RJiQlWxiPJy4VFlTw 1 1      2     0 24.5kb  18.8kbgreen open glossarytermindex_v2_1641502051585                    F2KVEKmPRz2m70WkSsDF1g 1 1      0     0   566b    283bgreen open datahubretentionindex_v2_1644255720451                StdL8srTTl6wVJifa1nbyA 1 1      0     0   566b    283bgreen open dataset_datasetusagestatisticsaspect_v1_1641502061808 4Kg9W1-cSPePzvVcUUnLBw 1 1      0     0   566b    283bgreen open mlfeaturetableindex_v2_1641502039615                  2iGd0dGIRrGaArSJkmiz2A 1 1      0     0   566b    283bgreen open datajob_datahubingestionrunsummaryaspect_v1           CDNmMTRDTeqcBh-7XiX17A 1 1      0     0   566b    283bgreen open mlprimarykeyindex_v2_1641502053085                    aOjDjKrCSQudh6FSVtmMkw 1 1      0     0   566b    283bgreen open dataset_datasetprofileaspect_v1                       8PefuaVqRI2OWMO1Ze7i_g 1 1      0     0   566b    283bgreen open mlmodeldeploymentindex_v2_1644255718602               Zybk2-3xSYyPAXOllgw8zw 1 1      0     0   566b    283bgreen open mlmodelgroupindex_v2_1641502041169                    w86nnQvsR2WdIekdSIuOiw 1 1      0     0   566b    283bgreen open corpuserindex_v2_1644255724994                        BJoL8M_kSfObNpR7cmjsNQ 1 1      2     0 14.2kb   7.1kbgreen open dataflowindex_v2_1641502045582                        3ibt0SjyT0aiA5Sy1gYlUQ 1 1      0     0   566b    283bgreen open graph_service_v1                                      gi9y5hM8T4aOreu3-NCd6w 1 1      0     0   566b    283bgreen open system_metadata_service_v1_1641502059168              C_pbCmO9T9eSg6ecYYt94A 1 1   9679     0    2mb     1mbgreen open chartindex_v2_1641502056160                           V5Ep92SlTrW1UWIW2Toh9A 1 1      0     0   566b    283bgreen open dataprocessindex_v2_1641502038108                     F6zqo_PYTQu8edBWHdzXxQ 1 1      0     0   566b    283bgreen open glossarynodeindex_v2_1641502042625                    oS_yIopwSTyyXtbZxoGrTA 1 1      0     0   566b    283bgreen open schemafieldindex_v2_1641502048525                     fPwcSr8GSiet6uDFR9X9ow 1 1      0     0   566b    283bgreen open dataplatformindex_v2_1644255721972                    VWoJNIn9Qimg6eWShuom8g 1 1      0     0   566b    283bgreen open datahubpolicyindex_v2_1644255723467                   pCQek1yHQcKwdd3SNVo5SQ 1 1      0     0   566b    283bgreen open datasetindex_v2_1641502054631                         uEoYrqfxTPek-1pn9gwcOg 1 1   1622  1001  3.2mb   1.7mb

Kafka Preparation

Datahub created this Kafka-setup.sh to understand what topics are needed to be created beforehand. Since most entities uses Avro data format, you might need to register the avro schema with your schema registry.

Once you have built the Datahub project successfully, you will find some avsc files