Deploy Open Source Datahub — Part II

Liangjun Jiang
3 min readFeb 9, 2022

In the first part of this blog, I talked about the services Datahub have and use, you only need to deploy datahub-frontend-react and datahub-gms, and you rely on your cloud infra. team providing the Elastic Search, Kafka and MySQL. I also discussed the application specified setup with MySQL, Elastic Search and Kafka. Now I want to dive in some issues I have faced and how I solved.

Deploy datahub-frontend-react and datahub-gms Services

Having datahub-frontend-react and datahub-gms docker images, you might fire up two virtual machines (or one) with Azure or AWS, you use (or modify) this docker env for the frontend service, use (and modify) this docker env for the backend service with your ES host, Kafka and MySQL connections. It should just work.

However, what about you deploy with Azure App Services, or AWS Elastic Beanstalk, you might have some problem as I had. You can find more about the problem by reading through the Datahub Slack thread.

What happened is that, Azure App Services and AWS Elastic Beanstalk will add a load balancer in front of your datahub-gms services. datahub-frontend can’t talk to datahub-gms (the backend service) through a load-balancer. The load-balancer doesn’t like some of datahub-frontend’s request headers, and will reject the request totally. In the end, your frontend will never be able to talk your gms backend, even though you have used other means (Graphql API) to talk to your gms backend, get response smoothly.

Some poeple’s fix is that using datahub-gms direct IP address, not load-balancer’s ip address or host for the frontend. If this is not an option to you, my solution is to comment out (remove) those lines in this Appliction.java file.

.setHeaders(request()            .getHeaders()            .toMap()            .entrySet()            .stream()            .filter(entry -> !AuthenticationConstants.LEGACY_X_DATAHUB_ACTOR_HEADER.equals(entry.getKey())) // Remove X-DataHub-Actor to prevent malicious delegation.            .filter(entry -> !Http.HeaderNames.CONTENT_LENGTH.equals(entry.getKey()))            .filter(entry -> !Http.HeaderNames.CONTENT_TYPE.equals(entry.getKey()))            .filter(entry -> !Http.HeaderNames.AUTHORIZATION.equals(entry.getKey()))            .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue))        )

Elastic Search (ES) Authentication

Datahub implemented some ES authentication mechanism. The default for local environment is over http without username and password. The next one is over https with basic auth (username and password).

What about that your ES cluster is http also requires basic auth? You’d say it doesn’t make sense, but unfortunately, things happen.

So you need to edit this line 70 of this RestHighLevelClientFactory.java, make sure to support the No SSL but with basic auth.

In gms docker environment file, it seems Datahub is supposed to support java keystore and truststore based on authentication, I didn’t see the code has the implementation for that though.

Custom ES Search Index Prefix

If you are required to add a prefix to your Datahub entities ES index name, for example, the default index for tag is

tagindex_v2

and you are required to add abc_mouse as the prefix, so it should be

abc_mouse_tagindex_v2

You can add this into your gms docker env

INDEX_PREFIX=abc_mouse

Kakfa SSL

If you are using Confluent Cloud Kafka, the managed Kafka, I think you should be good to go. I didn’t really try. I read the line of the gms dockerfile

cp /usr/lib/jvm/java-1.8-openjdk/jre/lib/security/cacerts /tmp/kafka.client.truststore.jks \

I assume it should be the solution to connect to Confluent Cloud Kafka.

But what about your Kafka cluster’s SSL is self-signed? It means you need to put truststore, keystore and its passwords etc, and make some code changes.

First of all, you need to modify this application.yml to define some system environments

protocol: ${KAFKA_PROTOCOL:SSL}
ssl:
trustStoreLocation: ${KAFKA_TRUST_STORE_LOCATION}
trustStorePassword: ${KAFKA_TRUST_STORE_PASSWORD}
keyStoreLocation: ${KAFKA_KEY_STORE_LOCATION}
keyStorePassword: ${KAFKA_KEY_STORE_PASSWORD}
keyPassword: ${kAFKA_KEY_PASSWORD}

Secondly, you will need to modify KafkaEventProducerFactory.java and KafkaEventConsumerFactory, add the following as the varaibles

@Value("${kafka.protocol}")
private String protocol;

@Value("${kafka.ssl.trustStoreLocation}")
private String trustStoreLocation;

@Value("${kafka.ssl.trustStorePassword}")
private String trustStorePassword;

@Value("${kafka.ssl.keyStoreLocation}")
private String keyStoreLocation;

@Value("${kafka.ssl.keyStorePassword}")
private String keyStorePassword;

@Value("${kafka.ssl.keyPassword}")
private String keyPassword;

and add those below line #65

if (protocol != null && protocol.equals("SSL")) {
props.put("security.protocol", protocol);
props.put("ssl.truststore.location", trustStoreLocation);
props.put("ssl.truststore.password", trustStorePassword);
props.put("ssl.keystore.location", keyStoreLocation);
props.put("ssl.keystore.password", keyStorePassword);
props.put("ssl.key.password", keyPassword);
}

Of course, don’t forget adding those variables into GMS’s Docker env.

KAFKA_PROTOCOL=SSL
KAFKA_TRUST_STORE_LOCATION=
KAFKA_TRUST_STORE_PASSWORD=
KAFKA_KEY_STORE_LOCATION=
KAFKA_KEY_STORE_PASSWORD=
kAFKA_KEY_PASSWORD=

I think those are the problems or customization I need to do to get the services working for me.

--

--