The Sequence Diagram of Apache Airflow

Liangjun Jiang
2 min readJul 29, 2020

--

Operating Apache Airflow — the sequence diagram

I am revisiting some system design questions, such as design a distributed web crawler. There are some written answers for these questions. After reading those answers, it have kept my wonder: what’s system design?

I don’t mean to write my understanding about system design. It basically doesn’t matter, I guess.

Similarly, I have read another type of question: Product design. So what’s product design for an engineer?

When designing a system, a system component shouldn’t be the first? Like I design an electronic toy car, shouldn’t listing parts needed be the first step?

Once I have a part list, the second step I am thinking is to put them together based on certain chronological logic as a starter to make sure, at least, it works as the simplest and most likely case. So I need to design the logic.

The third step, I am thinking to optimize my system performance by carefully choosing right parts for my electronics toy car. Similarly, it’s the time to do some calculation.

Anyway, to design a distributed web crawler, I personally found the design of Apache Airflow can be learned and might be directly applied. Actually, if I am gonna crawl websites, I might just use Apache Airflow to do it.

By using Apache Airflow, we start with writing your DAG as step 1 in the diagram above.

2. We make Airflow be aware of our DAG existing in the file system of Airflow recognizing

3. As a user, we set when and how our DAG should be running. Those setting will be persisted into the Airflow’s database choice: PostgreSQL, also be registered with the Scheduler.

4. Scheduler sends a message to message queue service at scheduled time. Obviously, the message contains the brief information about this job.

5. Worker always pull messages from message queue service, will find the new message at scheduled time. It will query PostgreSQL to get more information if needed. For example, this DAG is meant to work with Hive to do some ETL, it needs a credential to access Hive. We persist the credential in PostgreSQL; Also DAG itself might be a big file, PostgreSQL most likely only stores the path to this DAG, the worker needs to find the DAG file and execute it accordly.

6. Worker persists the job running result or status to PostgreSQL

7. Frontend periodically fetches, get an update if any, and render the UI.

--

--