Best Metadata Platform in 2022?

Liangjun Jiang
4 min readMay 10, 2022

OK, the title itself is obviously a clickbait. Each metadata platform (Datahub, Alation, Collibra, Apache Atlas, Atlan, Open Metadata, and more) is the best on its own right. Also from the my limited experience with Datahub, Alation, Collibra, Apache Atlas, Open Metadata, I have the feel that the features provided by each of them are getting close to each other with days going by.

image credit

Alation and Collibra, two leading commercial products, have been there longer than others, provide more feature offering than other open source products or some new commercial products.

Are we there yet?

Can we safely say we will actually solve the metadata problem in very soon? Whoever, the metadata platform is meant to serve, will be happy.

I don’t think so. I always think the biggest problem of the metadata discovery solution is the lack of full and complete content, not the lack of the discovery features.

Full — does this metadata platform have all the metadata it is supposed to support? For example, this metadata discovery tool supports metadata types such as SQL schemas, Kafka schemas, Machine learning models, business glossaries. Can you trust this tool have all the available types’s content?

Complete — for each metadata item, does it have all the aspects (information) it is supposed to have? For example, for the SQL schemas, it is supposed to have table owners, and proper sensitivity tag for certain columns beside the schemas itself. Does it have?

Lack of Content Problem

For the metadata discovery tool, it works in this way: as long as you (I don’t know who) send this information to me, I will show it, and make it searchable and provide the features you can do whatever you want to, such as tagging, adding descriptions, links, etc.

But the reality, besides the basic schemas, I hardly see other info is populated.

In the end, once your data scientists or other users, couldn’t see who really owns the table, he/she still has to ask around. Not a surprise, requesting for the table owner is the frequentest request I have experienced.

Another Dilemma

I think, for the most platform tool, the one who creates/ingests the content is also the one who uses them. The biggest problem, for metadata discovery platform, the one who creates the content, really doesn’t use this tool, or at least, he/she is not going to come to this tool to verify the content.

The Audience

Another problem I saw is that, there are just too many potential audience and you, the one who develops the metadata discovery tool, hardly know whose problem you are solving.

For example, the users of the metadata discovery tool, can be analysts, data scientist. They come here to look for data catalog (schemas, tables). Their needs can be met. However, they immediately want to see the real data, so they will ask data preview feature. Otherwise, they probably just stop by at the Query Engine (Trino, Starbust, Querybook, etc) first.

Pulling data catalog is quite different from pulling the real data.

Then data governance people comes in. They want to see the data catalog are properly managed — does sensitive fields have a sensitive tag? If no, another question will be asked? how to fix that? do you let someone enter the tag through the metadata discovery tool’s UI? If so, who is this someone? The developers who create the table schema doesn’t use this discovery tool so they will not come. And if there is no table owner, and you don’t know who you should ask.

Some tool starts to use the word : Data Steward. To me, it’s more like how does this steward know which one should be tagged?

Also you would ask, will the metadata discovery tool be the source of the truth? Even though the data steward marked a field or added some tags, if the table schemas changed, it could make the tag invalid. In the other word, the content generated through the discovery tool can’t be trusted. Only the one who created the content should be the one updating the content.

If more metadata types are introduced, more dilemma stated earlier will surface. For example, the dashboard, the machine learning models, the data jobs and flows, business terms or glossaries, etc.

I think, all the problems is really about two things:

1. content creator is not the user of this discovery tool

2. who is the south of truth of the content

Data Catalog V.S. Metadata Discovery Platform

Data Catalog is a subset of the metadata discovery platform, and the data catalog tool has been there for a long time.

I think, before really try to develop features to support other metadata, we really need to examine harder, if we can solve the one type of metadata’s problem.

For Data catalog itself, it’s reasonable to ask schemas, owners (team or individual), sensitivity tags, and other tags (domain, business unit, etc), data preview and some data profiling. I think it should be good enough. If a lineage can be done properly, it would be the bonus.

But, how? how to collect the accurate owner and other tags?

I will propose my solution into the second part.

--

--