Data and DevOps

Follow

Data and DevOps

Follow

Apache Airflow, Which Executor to use in Production?

Apache airflow has Celery, Kubernetes, CeleryKubernetes and Dask Executors. This post explores what those are and how to use them.

Bhavani Ravi's photo
Bhavani Ravi
ยทJan 18, 2023ยท

2 min read

Table of contents

  • Celery Executor
  • Kubernetes Executor
  • Kubernetes Celery Executor

Celery Executor

Celery is used for running distributed asynchronous python tasks.

Hence, Celery Executor has been a part of Airflow for a long time, even before Kubernetes. With Celery Executors, you must set a specific number of worker instances.

Pros

  1. In Airflow, you can specify the number of tasks that can run in a given worker. It is a good idea if you have a predictable number of tasks to run on a given worker.

  2. Celery manages the workers. In case of a failure, Celery spins up a new one.

Cons

  1. Celery needs RabbitMQ/Redis for queuing the task, an added dependency.

  2. Multiple tasks run on the same worker, which means one task can clog all the resources available for another.

  3. Running multiple workers all the time might lead to wasting resources when there isn't much to process.

Kubernetes Executor

KubernetesExecutor is where Airflow spins up a new pod to run an Airflow task.

Pros

Unlike Celery's executor, the advantage is you don't have a bunch of workers always running. KubernetesExecutor is on-demand, thereby reducing cost.

KuberentesExecutor lets you specify the resources required for each task giving you more control.

Cons

One downside of Kubernetes executor can be the time it takes to spin up the pod but compared to the advantages, it can be close to null.

Setting up the infrastructure can be complicated if you don't have the Kubernetes skillset in your team.

Kubernetes Celery Executor

KubernetesCeleryExecutor brings the best of both Celery and Kubernetes worlds and also the worst. It is a good idea to use them only when they are necessary.

  1. You have a few resource-hungry tasks which need high resource and isolation so that it doesn't clog other tasks.

  2. You have a mixture of peak times tasks with longer queues that can be run using Kubernetes and other predictable tasks with predictable resources which Kubernetes can handle.

In this post, you have seen how to utilize different Airflow executors to improve your tasks' performance while simultaneously optimizing the costs.


Got Airflow issues? I would be happy to assist you.

Schedule a free Discovery call today

Did you find this article valuable?

Support Data and DevOps by becoming a sponsor. Any amount is appreciated!

See recent sponsors |ย Learn more about Hashnode Sponsors
ย 
Share this