Design a End-to-End Data Lineage Solution

Liangjun Jiang
6 min readApr 12, 2022

Let’s assume you have had some knowledge about Data Lineage — what it is, what it does, and how it is presented.

What’s Data Lineage

If not, let’s define what’s the data lineage, borrowed from here

Data lineage uncovers the life cycle of data — it aims to show the complete data flow, from start to finish. Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. This includes all transformations the data underwent along the way — how the data was transformed, what changed, and why.

A nice (or maybe misleading) graph illustration can be seen:

Data Lineage Illustration, source

The Value of Tracking Data Lineage

What’s the value of tracking data lineage? First of all, it’s not an easy task to tracking data lineage. Let’s say, we have a perfect lineage collection and presentation system, what value can it bring to the company? Some people answered a survey conducted by OSS Datahub, and have the following voted answers

Lineage Value

Screen shot is collected through OSS Datahub Slack channel Survey

Let’s say those 4 values are that a complete and accurate data lineage can bring in.

Naturally, it brings two aspects of Data Lineage: collection and presentation. This blog will break down into two parts: lineage collection and presentation

Lineage Collection

First of all, let’s look at all possible use cases which generate lineage, and any lineage collection solution for this case.

Within a Database

In the laymen term, or maybe the simplest term, a data lineage from a SQL language as such

"insert into db1.table1 select * from db2.table2"
Statements(#): 1
Source Tables:
db2.table2
Target Tables:
db1.table1

Data lineage in this context is self — explanatory. db1.table1 ‘s derived from db2.table2.

The most simple case which could generate the lineage info is that, an engineer wrote a Insert — Select SQL with a choice of SQL database as such

with customer_total_return as
(select sr_customer_sk as ctr_customer_sk
, sr_store_sk as ctr_store_sk
, sum(SR_RETURN_AMT) as ctr_total_return
from store_returns
, date_dim
where sr_returned_date_sk = d_date_sk
and d_year = 2000
group by sr_customer_sk
, sr_store_sk)
insert overwrite table query01
select c_customer_id
from customer_total_return ctr1
, store
, customer
where ctr1.ctr_total_return > (select avg(ctr_total_return) * 1.2
from customer_total_return ctr2
where ctr1.ctr_store_sk = ctr2.ctr_store_sk)
and s_store_sk = ctr1.ctr_store_sk
and s_state = 'TN'
and ctr1.ctr_customer_sk = c_customer_sk
order by c_customer_id
limit 100;

Then we will see a visual as such for table level lineage

Table-level lineage , generated by SQL Parser

And its column level lineage

Column-level lineage, generated by SQL Parser

In this case, the data aggregation SQL statement happens only within the given SQL database. The source table is in the this database , the target table is also in this database.

Solution

A general solution to this use case is to use a technique called Query Parser, or SQL Parser. In general, The SQL Parser engine receives the query log of the database, and analyze the query log line by line, and find the lineage info.

For example, in Alation, this is how the industry leading data catalog product, Alation, collects lineage.

Since different the SQL syntax in different database might be different somehow, it makes a one-size-fit-all SQL Parser solution is quite difficult.

Across Different Platforms, Databases and Tools

It’s getting much more complicated when data flows across different platforms, databases and tools.

case 1: Engineers move data from classic database (SQL Server, MySQL, Teradata), and use a producer app. to publish to Kafka stream, another consumer job to pull those data, and write to datalake, illustrated as

Stream lineage, source

For Kafka lineage, Confluent Kafka provides the built-in lineage tool to answer

Where did data come from?
Where is it going?
Where, when, and how was it transformed?

Besides the feature in Confluent Kafka, there is other solution such as Cloudera Kafka Atlas Hook to provide the similar info. It will present information as such — who are a topic’s producers, and who are the consumers.

Kafka Stream Lineage, source

case 2: Engineer move data from classic databases, and transform and load into a data lake, illustrated as such. To make it repeatable, the ETL process is deployed with Airflow environment.

ETL with Airflow, source

Solution

With Airflow 2.2.5, Airflow Lineage is the built-in experimental feature, and it provides Lineage API to third-party backend. We can also use Datahub’s Airflow Lineage for our purpose with necessary data model transformation.

Airflow Lineage has a good integration with OpenLineage, with supported databases such as BigQuery, SnowFlakes and so on.

case 3: Engineer use Spark to compute and transform data from S3 and load into Data lake. The job might be one time or schedule regularly with Airflow. The case is very similar to the one above.

Spark compute

Spline Lineage is one of the solutions to collect the detailed lineage info. Similarly, OSS Datahub lately also publishes its Spark Lineage solution.

Case 4: Engineer uses tools such as Sqoop, dbt to pull data from one source and directly copy to another source.

dbt lineage, source

Solution

The direct data copy from one data source to another data destination needs to be tool-dependent solution. For example, dbt itself provides lineage for you.

Case 5: it’s not uncommon that the data analysts aggregates data and show the result as a dashboard in Looker and Tableau, or post to API endpoint.

source

End-to-End solution

As we can see, the schemas can be landed into databases from the different tools, and time. The end to end solution to tackle the lineage is that, in my opinion, is that the schema still can be found the correct nodes even the lineage is created by the different tools, and different time.

For example, we use a Producer app to pull data from Table MySQL_A from a MySQL database, and , and load those data into Kafka stream with a topic TA and its schema Kafka_A and use a consumer app to land those data into the data lake, named as DL_A_temp. After a few days, we use an ETL job with Airflow to transform DL_A_temp into DL_A.

Lineage across different system, tools and time

In this case, the lineage for this DL_A from the upstream point of view would be

MySQL_A -> Kafka_A -> DL_A_Temp -> DL_A

Most likely, we can get the lineage from MySQL-A to Kafka_A based on Producer_A’s implementation.

Similarly, we can get the lineage from Kafka_A to DL_A_Temp based on Consumer_A’s implementation. DL_A_Temp to DL_A happens by another scheduled ETL job which is not part of Consumer_A’s responsibility.

But how could we connect DL_A all the way to the MySQL_A?

A Graph DB will help us connect those discrete lineage, and piece them together for find the end-to-end lineage of DL_A.

Summary

This link also summarized all possible cases that data lineage is collected in an organization. In the end, we are thinking about our solutions needs to be

  1. No manual and extra work for developers
  2. Need to be as automatically as possible

--

--