<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Data and DevOps]]></title><description><![CDATA[Deep Dive on Data Engineering and Devops concepts]]></description><link>https://dataanddevops.com</link><generator>RSS for Node</generator><lastBuildDate>Thu, 16 Apr 2026 13:32:04 GMT</lastBuildDate><atom:link href="https://dataanddevops.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Apache's Opensource World Of Data Tools]]></title><description><![CDATA[Apache software foundation entering it's 22nd year has paved way for tonnes of data tools that sets state-of-the-art methods, models, and guidelines for Data engineers and DevOps practitioners to think about their Data infrastructure. The goal of thi...]]></description><link>https://dataanddevops.com/apaches-opensource-world-of-data-tools</link><guid isPermaLink="true">https://dataanddevops.com/apaches-opensource-world-of-data-tools</guid><category><![CDATA[Data Science]]></category><category><![CDATA[Open Source]]></category><category><![CDATA[Devops]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Bhavani Ravi]]></dc:creator><pubDate>Wed, 08 Mar 2023 05:47:42 GMT</pubDate><content:encoded><![CDATA[<p>Apache software foundation entering it's 22nd year has paved way for tonnes of data tools that sets state-of-the-art methods, models, and guidelines for Data engineers and DevOps practitioners to think about their Data infrastructure. The goal of this post is to highlight some of the famous tools and compare and contrast them</p>
<h2 id="heading-workflow-orchestration">Workflow Orchestration</h2>
<h3 id="heading-apache-airflow">Apache Airflow</h3>
<p>Apache Airflow is a platform to programmatically author, schedule, and monitor workflows in Python. It is used for managing data pipelines and can integrate with other tools like Spark, Hive, and Kudu.</p>
<p>The open-ended nature of the tool is great for small ETL jobs. The extensive <a target="_blank" href="https://airflow.apache.org/docs/#providers-packagesdocsapache-airflow-providersindexhtml">providers support</a> gives hooks to all major platforms and data tools.</p>
<p>We cannot talk about Airflow without mentioning its community support. The slack channel is always buzzing with questions and people helping each other out. The documentation gets better every day. 9K+ questions of Stackoverflow and a tool that's forever improving.</p>
<ul>
<li><p>Workflow management</p>
</li>
<li><p>Small ETL jobs</p>
</li>
<li><p>One-off/schedule-based tasks</p>
</li>
<li><p>Backups</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-meta">@dag(</span>
    start_date=pendulum.datetime(<span class="hljs-number">2021</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, tz=<span class="hljs-string">"UTC"</span>),
    catchup=<span class="hljs-literal">False</span>
)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">example_dag_decorator</span>(<span class="hljs-params">email: str = <span class="hljs-string">"example@example.com"</span></span>):</span>
    get_ip = GetRequestOperator(task_id=<span class="hljs-string">"get_ip"</span>, url=<span class="hljs-string">"http://httpbin.org/get"</span>)

<span class="hljs-meta">    @task(multiple_outputs=True)</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">send_email</span>(<span class="hljs-params">raw_json: dict[str, Any]</span>) -&gt; dict[str, str]:</span>
        <span class="hljs-comment"># write logic here</span>
        <span class="hljs-keyword">pass</span>

    email_info = send_email(get_ip.output)

example_dag = example_dag_decorator()
</code></pre>
<h3 id="heading-apache-oozie">Apache Oozie</h3>
<p>Apache Oozie is a workflow orchestrator similar to Airflow written in Java and particularly meant to manage Hadoop jobs and support MapReduce, Pig, and Hive.</p>
<p>In Comparision with Airflow the Workflow Automation market, Apache Airflow has a 33.72% market share in comparison to Apache Oozie’s 5.93%</p>
<h3 id="heading-apache-luigi">Apache Luigi</h3>
<p>Apache Luigi developed by Spotify is a Python-based workflow management system that is used for scheduling and managing multi-node data pipelines. It is simple to use and is suited for small to medium-sized data workflows.</p>
<p>Though by definition Luigi sounds like Airflow, Luigi is based on pipelines of tasks that share input and output information and is target-based. Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.</p>
<p>Workflows in Luigi are made of Tasks and Targets. Tasks are classes with well-defined input, processing and output. Targets are data sources or destinations.</p>
<pre><code class="lang-python">
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">NewTask</span>(<span class="hljs-params">luigi.Task</span>):</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">output</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-keyword">return</span> luigi.LocalTarget(<span class="hljs-string">"data/output_two.txt"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">requires</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-keyword">return</span> OldTask()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-keyword">with</span> self.input().open(<span class="hljs-string">"r"</span>) <span class="hljs-keyword">as</span> input_file:
            line = input_file.read()

        <span class="hljs-keyword">with</span> self.output().open(<span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> output_file:
            decorated_line = <span class="hljs-string">"My "</span>+line
            output_file.write(decorated_line)
</code></pre>
<h2 id="heading-data-processinganalytics">Data-Processing/Analytics</h2>
<h3 id="heading-apache-spark">Apache Spark</h3>
<p>6 Lakh records imagine loading that into a spreadsheet or pandas dataframe to perform some transformation. Impossible unless you're fine with your system giving up on you. We need a tool that can smartly chunk these data and process them independently</p>
<p>Apache Spark is a fast and general-purpose engine for large-scale data processing. It is designed to process data in parallel and can handle batch, streaming, and machine-learning workloads. It supports multiple languages like Python, Java, R.</p>
<p>The main advantage of Spark is that it provides a <em>resilient distributed dataset</em> (RDD)(More like a Pandas dataframe), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel</p>
<p>Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources</p>
<pre><code class="lang-python">url = \
  <span class="hljs-string">"jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"</span>
df = sqlContext \
  .read \
  .format(<span class="hljs-string">"jdbc"</span>) \
  .option(<span class="hljs-string">"url"</span>, url) \
  .option(<span class="hljs-string">"dbtable"</span>, <span class="hljs-string">"people"</span>) \
  .load()

<span class="hljs-comment"># Looks the schema of this DataFrame.</span>
df.printSchema()
countsByAge = df.groupBy(<span class="hljs-string">"age"</span>).count()

countsByAge.write.format(<span class="hljs-string">"json"</span>).save(<span class="hljs-string">"s3a://..."</span>)
</code></pre>
<h3 id="heading-apache-beam">Apache Beam</h3>
<p>Apache beam is a streaming ETL tool that can handle batches as a special case of stream processing</p>
<p>A beam like every other orchestrator tool consists of Pipelines. But what makes it different is it each node in the pipeline can be a PCollection(Data) node or PTransform(function) node. They are also great for both batch and stream-based data processing.</p>
<p>The best part about Apache Beam is the multi-language support of Java, Python, Go, and Scala.</p>
<h3 id="heading-apache-flink">Apache Flink</h3>
<p>Apache Flink is an open-source, distributed stream processing framework for real-time, event-driven data processing. It supports batch and stream processing and is designed for scalability, fault tolerance, and low latency.</p>
<p>Flink is great for real-time reporting, fraud detection, stream analytics</p>
<h3 id="heading-apache-storm">Apache Storm</h3>
<p>Apache Storm is a free and open-source distributed real-time computation system. is great for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.</p>
<p>Apache Storm is fast: a benchmark clocked it at over <strong>a million tuples processed per second per node</strong>. It is scalable, fault-tolerant guarantees your data will be processed, and is easy to set up and operate.</p>
<p>Apache Storm integrates with the queueing and database technologies you already use. Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.</p>
<h3 id="heading-apache-samza">Apache Samza</h3>
<p>Apache Samza developed by Linkedin is a distributed stream processing framework that is designed for real-time event processing. It supports stateful stream processing and integrates with Apache Kafka for its messaging layer.</p>
<p>While Apache Beam lets us define the workflow definition, Samza can be the actual executor of those transformations.</p>
<p>Samza is a newer ecosystem compared to Storm and is specifically designed to work well for Apache Kafka. Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are ordered sequences where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.</p>
<h3 id="heading-apache-kudu">Apache Kudu</h3>
<p>Apache Kudu is an open-source storage system for fast analytics on fast data. It supports low latency, random access, and efficient compression and is suitable for use cases such as online serving, real-time analytics, and offline batch processing.</p>
<h2 id="heading-visualization">Visualization</h2>
<h3 id="heading-apache-superset">Apache Superset</h3>
<p>Apache Superset is a modern, enterprise-ready business intelligence web application. It is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data.</p>
<h2 id="heading-messagingqueuing">Messaging/Queuing</h2>
<h3 id="heading-apache-kafka">Apache Kafka</h3>
<p>The moment you hear stream processing in the data world you cannot ignore Apache Kafka. The event streaming platform is used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.</p>
<p>Apache Kafka is scalable, durable, reliable and high performant which makes it a perfect tool of choice for large volumes of data streaming.</p>
<p>Common usecases for Kafka includes</p>
<ul>
<li><p>Internet of Things and stream processing</p>
</li>
<li><p>Log aggregation</p>
</li>
<li><p>Large-scale messaging</p>
</li>
<li><p>Customer activity tracking</p>
</li>
<li><p>Operational alerting and reporting</p>
</li>
</ul>
<h2 id="heading-storage">Storage</h2>
<h3 id="heading-apache-hive">Apache Hive</h3>
<p>Apache Hive is a data warehousing and SQL-like query language for big data. It provides an interface to analyze structured data stored in the Hadoop file system (HDFS) using SQL-like syntax.</p>
<h3 id="heading-apache-parquet">Apache Parquet</h3>
<p>Apache Parquet is a columnar storage format for Hadoop that provides efficient data compression and encoding algorithms. It is optimized for interactive and batch analytics and supports advanced features like predicate pushdown.</p>
<h3 id="heading-apache-iceberg">Apache Iceberg</h3>
<p>Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.</p>
<p>As of Dec 2022, these tables can also be used by BigQuery for large data analytics</p>
<hr />
<p>Yes, Apache has an overwhelming number of tools to pick from for your data infrastructure. Choose the ones that fit you the best based on use case, community support, maintenance and the skill level of your team.</p>
<p>Let's discuss your data infrastructure needs. <a target="_blank" href="https://zcal.co/bhavaniravi/airflow-starterpack"><strong><em>Schedule a free call today</em></strong></a></p>
]]></content:encoded></item><item><title><![CDATA[ETL vs. ELT Data Pipelines]]></title><description><![CDATA[ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are both data integration processes used to move and transform data from one system to another.
The main difference between ETL and ELT is the order in which the data is processed.
In ...]]></description><link>https://dataanddevops.com/etl-vs-elt-data-pipelines</link><guid isPermaLink="true">https://dataanddevops.com/etl-vs-elt-data-pipelines</guid><category><![CDATA[Data Science]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Devops]]></category><category><![CDATA[mlops]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Bhavani Ravi]]></dc:creator><pubDate>Fri, 27 Jan 2023 00:33:37 GMT</pubDate><content:encoded><![CDATA[<p>ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are both data integration processes used to move and transform data from one system to another.</p>
<p>The main difference between ETL and ELT is the order in which the data is processed.</p>
<p>In ETL, data is extracted from one or more sources, transformed to fit the target system's schema, and then loaded into the target system.</p>
<p>In ELT, data is extracted from one or more sources and loaded into the target system before it is transformed to fit the target system's schema.</p>
<h2 id="heading-when-to-use-etl-vs-elt">When to use ETL vs. ELT?</h2>
<p>It's worth noting that ETL and ELT are not mutually exclusive and can be used together in a hybrid approach.</p>
<p>A hybrid approach to ETL and ELT can be implemented in a few different ways, depending on the specific requirements of the organization and the data integration project.</p>
<h3 id="heading-real-time-vs-batch-processing">Real-time vs. batch processing</h3>
<p>If real-time data processing is a priority, ELT may be a better option as it allows for near-instant data loading and processing. ETL, on the other hand, is better suited for batch processing of data on a schedule.</p>
<h3 id="heading-data-volume-and-complexity">Data volume and complexity</h3>
<p>ELT is well suited for handling large volumes of data, such as data from social media or IoT devices, and can handle data in various formats and structures. ETL is better suited for handling structured data from traditional sources, such as relational databases.</p>
<h3 id="heading-data-transformation">Data transformation</h3>
<p>ELT is designed to handle simple data transformations, such as data cleaning and validation, while ETL is designed to handle more complex data transformations, such as data mapping and aggregation.</p>
<h3 id="heading-data-quality-and-governance">Data quality and governance</h3>
<p>ETL is typically used when data quality and governance is a top priority as it allows for the preprocessing of data before it is loaded into the target system. ELT, on the other hand, may not have the same level of data quality and governance capabilities.</p>
<h3 id="heading-cost-and-resources">Cost and resources</h3>
<p>ELT is generally less expensive and requires fewer resources than ETL. ELT uses the processing power of the target system to perform data transformations, while ETL requires a separate system for data processing.</p>
<h3 id="heading-hybrid-approach">Hybrid approach</h3>
<ol>
<li><p><strong>ELT for real-time data and ETL for historical data</strong>: In this approach, ELT is used to handle real-time data streams and load them into the target system in near real time. ETL is then used to process historical data in batches and load it into the target system on a schedule. This allows organizations to take advantage of the real-time capabilities of ELT while still maintaining a historical record of their data.</p>
</li>
<li><p><strong>ELT for simple transformations and ETL for complex transformations</strong>: In this approach, ELT is used to handle simple data transformations, such as data cleaning and validation, while ETL is used to handle more complex data transformations, such as data mapping and aggregation. This allows organizations to take advantage of the performance benefits of ELT while still being able to handle more complex data transformation needs.</p>
</li>
<li><p><strong>ELT for big data and ETL for structured data</strong>: In this approach, ELT is used to handle big data, such as data from social media or IoT devices, and load it into the target system. ETL is then used to handle structured data, such as data from a relational database, and load it into the target system. This allows organizations to take advantage of the scalability and flexibility of ELT for big data while still being able to handle structured data more traditionally.</p>
</li>
</ol>
<h2 id="heading-tools">Tools</h2>
<h4 id="heading-etl">ETL</h4>
<ul>
<li><p>Informatica PowerCenter</p>
</li>
<li><p>IBM DataStage</p>
</li>
<li><p>Microsoft SQL Server Integration Services (SSIS)</p>
</li>
</ul>
<h4 id="heading-elt">ELT</h4>
<ul>
<li><p>Apache NiFi</p>
</li>
<li><p>Apache Kafka</p>
</li>
<li><p>AWS Glue</p>
</li>
</ul>
<p>The tools mentioned here are pure ELT or ETL tools. You can also use General purposes tools like Airflow/Prefect/Spark to achieve ELT/ETL operations.</p>
]]></content:encoded></item><item><title><![CDATA[Apache Airflow Bad vs. Best Practices In Production - 2023]]></title><description><![CDATA[Apache Airflow - The famous Opensource Pythonic general-purpose data orchestration tool. It lets you do a variety of things. The open-ended nature of the tool gives room for a variety of customization.
While this is a good thing, there are no bounds ...]]></description><link>https://dataanddevops.com/apache-airflow-bad-vs-best-practices-in-production-2023</link><guid isPermaLink="true">https://dataanddevops.com/apache-airflow-bad-vs-best-practices-in-production-2023</guid><category><![CDATA[Data Science]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[AWS]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Bhavani Ravi]]></dc:creator><pubDate>Mon, 23 Jan 2023 06:41:49 GMT</pubDate><content:encoded><![CDATA[<p><strong>Apache Airflow</strong> - The famous Opensource Pythonic general-purpose data orchestration tool. It lets you do a variety of things. The open-ended nature of the tool gives room for a variety of customization.</p>
<p>While this is a good thing, there are no bounds in which the system can or cannot be used. Resulting in wasting a lot of time in scaling, testing and debugging when things aren't set properly.</p>
<p><em>You can also use this post to convince your management</em> <strong><em>Why is it taking so much time?</em></strong> <em>why do you have to set standards and ground rules for your data projects? Jarek approved and recognized this blog post in Airflow's slack channel.</em></p>
<p><img src="https://pbs.twimg.com/media/FnN1dYWaEAUNq1j?format=png&amp;name=900x900" alt="Airflow Contributor Jarek Approved" class="image--center mx-auto" /></p>
<h2 id="heading-the-dev-amp-ops-team"><strong>The Dev &amp; Ops Team</strong></h2>
<p>Using Airflow for your Data team means you need a good mix of Python and DevOps skills.</p>
<p><strong><em>The Python Devs</em></strong> would take care of pipelining and debugging the code that runs on the pipelines whereas <strong><em>the Ops team</em></strong> ensures the infrastructure stays intact, is easy to use and debug as per need.</p>
<p>A badly written pipeline code can be resource hungry and the Ops teams will have no control over it. Similarly, a resource-constrained breaking infrastructure doesn't give a smooth development experience for DAG authors.</p>
<p>It is important for these teams to work hand in hand but also knows where their lines of responsibilities are.</p>
<h2 id="heading-infrastructure"><strong>Infrastructure</strong></h2>
<h3 id="heading-iac"><strong>IaC</strong></h3>
<p><em>You will always spin up a new airflow environment</em>. By default, any data team will need at least 3 Airflow environments - Dev, Test, and Prod. Factoring in the teams and projects you are going to need more environments. If not today you might need it 6 months down the line.</p>
<blockquote>
<p><strong><em>Code is the ultimate truth.</em></strong></p>
</blockquote>
<p>The maximum automation I've seen is bash scripts stitched together. That only gives you info about Airflow configs and not the infra configs. EC2 or EKS? Loadbalancer? DNS mapping?</p>
<p>There is a good chance that the DevOps engineer who set this up might move companies or to a different project. All that the teams are left with are a bunch of bash scripts and commands to run.</p>
<p>Instead of investing time in documenting the commands, you can spend that time automating it with Terraform or similar IaaC tools</p>
<h3 id="heading-cicd-pipeline"><strong>CI/CD Pipeline</strong></h3>
<p>Often you will be working with multiple dag authors. It is important to set up a system that provides them with a quick turnaround time to test and deploy their features</p>
<p>Environments, where dag authors are doing a git pull on the server directly, aren't going to cut it. There were also cases where you have dag authors fighting over the package versions they want for their PythonOperator to run.</p>
<p>There are three ways to get your DAGs into the Airflow instance</p>
<ul>
<li><p><strong>Building them into the instance</strong></p>
</li>
<li><p><strong>Mounting from S3</strong></p>
</li>
<li><p><strong>Using gitsync</strong></p>
</li>
</ul>
<p>Doesn't matter what method you choose the underlying idea is to have these update your Airflow instance when the developer is ready to do so.</p>
<h3 id="heading-not-customizing"><strong>Not Customizing</strong></h3>
<p>Airflow comes with a tonne of tools, tweaks and configs. One pattern I see commonly is following an infra guideline blindly and not customizing it to your needs.</p>
<p>Here are a few things to take into account</p>
<ul>
<li><p>How many environments would you need?</p>
</li>
<li><p>How many DAGs are going to run?</p>
</li>
<li><p>What infra tools work the best for our team?</p>
</li>
<li><p>What are our peak schedules, will our resource allocation withstand it?</p>
</li>
<li><p>What is it going to cost?</p>
</li>
</ul>
<p>Even if you are building a POC keep all of these things in mind. Some teams are expected to go from POC to prod in a matter of weeks</p>
<h3 id="heading-over-optimizing"><strong>Over Optimizing</strong></h3>
<ul>
<li><p>You don't need a complicated infrastructure unless you're running 100s of dags with 100s of tasks. Pick a pattern and scale on demand.</p>
</li>
<li><p>Kubernetes lets you scale the best. You can run 3 pods one for each component and scale it as you need.</p>
</li>
<li><p>But there are also cases where Airflow instances run happily on consolidated EC2 for years</p>
</li>
<li><p>Pick your poison based on your current and projected needs.</p>
</li>
</ul>
<h2 id="heading-versions"><strong>Versions</strong></h2>
<h3 id="heading-using-old-airflow-version"><strong>Using old Airflow Version</strong></h3>
<p>Airflow made 12 releases in 2022(1 release per month). That's crazy! I know, It's hard to keep up. But if you are starting fresh.</p>
<ul>
<li><p>Start with the latest available version.</p>
</li>
<li><p>Keep in mind you might need to upgrade to new version sooner</p>
</li>
</ul>
<p><strong>Perks of using the latest airflow version</strong></p>
<ul>
<li><p>It gets you the support you require</p>
</li>
<li><p>Incorporate the bug fixes and new features that might fit your use case</p>
</li>
</ul>
<p><strong>What about new bugs?</strong></p>
<ul>
<li><p>That's a part of using opensource technologies. You get to jump in and help the community fix them.</p>
</li>
<li><p>But if you're cautious, maybe try 2 versions down. That way, some show-stopping bugs can be found on Github issues.</p>
</li>
</ul>
<h3 id="heading-following-old-docsblogs"><strong>Following old Docs/Blogs</strong></h3>
<p>Airflow has been around for years now. There are a variety of Guides, tools, and blogs out there. Some of them are so old that it doesn't do justice to the overhaul Airflow got in 2.0+.</p>
<p>One developer created a whole plugin following a blog, but what they should've used is a provider. The support for plugins as a provider was removed after 1.10.x versions. The result? 2 more weeks of development and testing</p>
<p>Even if you're using the Airflow docs page, ensure you are on the right version. The same goes for providers. Providers are also constantly updated and optimized to support the latest features of the integrating system.</p>
<h3 id="heading-not-freezing-library-requirements"><strong>Not Freezing Library Requirements</strong></h3>
<p>Those pip libraries with no specific version tagged are timebombs waiting to tick when you need the environment up and running 6 months down the line. Always enforce developers to tag all requirements with their respective versions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1674151509827/34424b03-ebe4-4b62-9fd9-944fbfe08495.png" alt /></p>
<h2 id="heading-airflow-components">Airflow Components</h2>
<h3 id="heading-running-scheduler-as-a-part-of-the-webserver">Running scheduler as a part of the webserver</h3>
<p>The scheduler is the heart and brain of Airflow. The practice to run Scheduler as a part of the webserver was propagated by <a target="_blank" href="https://github.com/puckel/docker-airflow/blob/master/script/entrypoint.sh#L110">Puckel Docker images</a> and still sticks around in most code bases.</p>
<p>The script suggests you run both webserver and scheduler on the same machine for <code>LocalExecutor</code> and <code>SequentialExecutor</code> This might still work well for small scale, as you add more dags, the scheduler will need more memory and computation to spin up your tasks, irrespective of the Executor you use.</p>
<h3 id="heading-right-executor-for-the-job">Right Executor for the Job</h3>
<p>Airflow provides a variety of executors. Choosing the right one depends on a lot of factors scale, team size, and dag use case.</p>
<p>I have a separate post on this checkout <a target="_blank" href="https://hashnode.com/post/cld1twgtz000408l48nqo083b">Pros and cons of using different Airflow executor</a></p>
<h3 id="heading-database">Database</h3>
<ul>
<li><p><strong>Using sqlite.db</strong>. It's not production friendly. The 1st thing I do with any Airflow installation is to change it to Postgres. (or MySQL)</p>
</li>
<li><p><strong>Running Database in a docker container or Kubernetes pod</strong> is not advised. Tutorials advise them to get things up and have consistent results for the reader. A quick alternative is to set up an RDS or managed database and use it. Services like supabase give you readily available Postgres</p>
</li>
</ul>
<h3 id="heading-not-persisting-logging">Not Persisting Logging</h3>
<p>You need to track two kinds of Logs for Airflow. The logs of Airflow components and The logs from running your pipelining code. Not having access to either one of those will create problems while debugging down the line.</p>
<p><em>"Our nightly run failed and we have no idea why"</em> is how it starts, on investigating we will find that the scheduler pod that was responsible for those spin ups evicted with all the logs associated with it.</p>
<h3 id="heading-large-data-processing">Large-Data processing</h3>
<blockquote>
<p>Apache Airflow being the famous Opensource general-purpose data orchestration tool, lets you do a variety of things and gives room for lots of customization.</p>
</blockquote>
<p>Yes, it is general-purpose. Yes, you can do a variety of things. But processing GBs or TBs of data is not Airflow's game. Often teams use Pandas to write small transformations or run queries, but that practice can propagate to large datasets as well.</p>
<p>The workaround</p>
<ul>
<li><p>Use the right tools for the right job like spark or Hadoop</p>
</li>
<li><p>Remember Airflow is not an ETL tool, it is a Trigger -&gt; Track -&gt; Notify tool</p>
</li>
</ul>
<h3 id="heading-top-level-python-code">Top-Level Python Code</h3>
<p>Airflow DAGs are Python-first. You can write any kind of Python code to generate DAGs</p>
<ul>
<li><p>Reading a DB or a large JSON file to load the config</p>
</li>
<li><p>Large operations outside the context of Airflow dags</p>
</li>
</ul>
<p>While these may look like a fair thing to do, these pop up as huge blockers as the infrastructure scale.</p>
<hr />
]]></content:encoded></item><item><title><![CDATA[Apache Airflow, Which Executor to use in Production?]]></title><description><![CDATA[Celery Executor
Celery is used for running distributed asynchronous python tasks.
Hence, Celery Executor has been a part of Airflow for a long time, even before Kubernetes. With Celery Executors, you must set a specific number of worker instances.
Pr...]]></description><link>https://dataanddevops.com/apache-airflow-which-executor-to-use-in-production</link><guid isPermaLink="true">https://dataanddevops.com/apache-airflow-which-executor-to-use-in-production</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Python]]></category><category><![CDATA[Redis]]></category><category><![CDATA[airflow]]></category><dc:creator><![CDATA[Bhavani Ravi]]></dc:creator><pubDate>Wed, 18 Jan 2023 15:38:26 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-celery-executor">Celery Executor</h2>
<p>Celery is used for running distributed asynchronous python tasks.</p>
<p>Hence, Celery Executor has been a part of Airflow for a long time, even before Kubernetes. With Celery Executors, you must set a specific number of worker instances.</p>
<h4 id="heading-pros"><strong>Pros</strong></h4>
<ol>
<li><p>In Airflow, you can specify the number of tasks that can run in a given worker. It is a good idea if you have a predictable number of tasks to run on a given worker.</p>
</li>
<li><p>Celery manages the workers. In case of a failure, Celery spins up a new one.</p>
</li>
</ol>
<h4 id="heading-cons"><strong>Cons</strong></h4>
<ol>
<li><p>Celery needs RabbitMQ/Redis for queuing the task, an added dependency.</p>
</li>
<li><p>Multiple tasks run on the same worker, which means one task can clog all the resources available for another.</p>
</li>
<li><p>Running multiple workers all the time might lead to wasting resources when there isn't much to process.</p>
</li>
</ol>
<h2 id="heading-kubernetes-executor">Kubernetes Executor</h2>
<p>KubernetesExecutor is where Airflow spins up a new pod to run an Airflow task.</p>
<h4 id="heading-pros-1">Pros</h4>
<p>Unlike Celery's executor, the advantage is you don't have a bunch of workers always running. KubernetesExecutor is on-demand, thereby reducing cost.</p>
<p>KuberentesExecutor lets you specify the resources required for each task giving you more control.</p>
<h4 id="heading-cons-1">Cons</h4>
<p>One downside of Kubernetes executor can be the time it takes to spin up the pod but compared to the advantages, it can be close to null.</p>
<p>Setting up the infrastructure can be complicated if you don't have the Kubernetes skillset in your team.</p>
<h2 id="heading-kubernetes-celery-executor">Kubernetes Celery Executor</h2>
<p>KubernetesCeleryExecutor brings the best of both Celery and Kubernetes worlds and also the worst. It is a good idea to use them only when they are necessary.</p>
<ol>
<li><p>You have a few resource-hungry tasks which need high resource and isolation so that it doesn't clog other tasks.</p>
</li>
<li><p>You have a mixture of peak times tasks with longer queues that can be run using Kubernetes and other predictable tasks with predictable resources which Kubernetes can handle.</p>
</li>
</ol>
<p>In this post, you have seen how to utilize different Airflow executors to improve your tasks' performance while simultaneously optimizing the costs.</p>
<hr />
<p>Got Airflow issues? I would be happy to assist you.</p>
<p><a target="_blank" href="https://zcal.co/bhavaniravi/airflow-starterpack"><strong>Schedule a free Discovery call today</strong></a></p>
]]></content:encoded></item><item><title><![CDATA[How to Setup and Run Apache Airflow Locally?]]></title><description><![CDATA[tl;dr get the bash script


Have Python installed in your system, 3.8+

Create a folder


mkdir -p "/Users/$(whoami)/projects/airflow-local"
export AIRFLOW_HOME="/Users/$(whoami)/projects/airflow-local"


cd airflow-local

Set airflow version AIRFLOW...]]></description><link>https://dataanddevops.com/how-to-setup-and-run-apache-airflow-locally</link><guid isPermaLink="true">https://dataanddevops.com/how-to-setup-and-run-apache-airflow-locally</guid><category><![CDATA[airflow]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[Devops]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Bhavani Ravi]]></dc:creator><pubDate>Tue, 17 Jan 2023 14:53:34 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p>tl;dr <a target="_blank" href="https://gist.github.com/bhavaniravi/bca59c263fff1e9e2924326c0139c421#file-airflow_local_setup-bash">get the bash script</a></p>
</blockquote>
<ol>
<li><p>Have Python installed in your system, 3.8+</p>
</li>
<li><p>Create a folder</p>
</li>
</ol>
<pre><code class="lang-bash">mkdir -p <span class="hljs-string">"/Users/<span class="hljs-subst">$(whoami)</span>/projects/airflow-local"</span>
<span class="hljs-built_in">export</span> AIRFLOW_HOME=<span class="hljs-string">"/Users/<span class="hljs-subst">$(whoami)</span>/projects/airflow-local"</span>
</code></pre>
<ol>
<li><p><code>cd airflow-local</code></p>
</li>
<li><p>Set airflow version <code>AIRFLOW_VERSION=2.4.3</code></p>
</li>
<li><p>Set airflow version <code>PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"</code></p>
</li>
<li><p>Set constraints version <code>CONSTRAINT_URL="</code><a target="_blank" href="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"><code>https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt</code></a><code>"</code></p>
</li>
<li><p>Create virtualenv and install airflow</p>
</li>
</ol>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> <span class="hljs-string">"/Users/<span class="hljs-subst">$(whoami)</span>/projects/airflow-local"</span>
python -m venv venv
<span class="hljs-built_in">source</span> venv/bin/activate
pip install <span class="hljs-string">"apache-airflow==<span class="hljs-variable">${AIRFLOW_VERSION}</span>"</span> --constraint <span class="hljs-string">"<span class="hljs-variable">${CONSTRAINT_URL}</span>"</span>
</code></pre>
<ol>
<li><p><code>mkdir -p "${AIRFLOW_HOME}/dags"</code></p>
</li>
<li><p>Run airflow with <code>airflow standalone</code></p>
</li>
</ol>
<p>This will run, creating an SQLite DB. This is not really production friendly. You can use them to run basic pipelines and playgrounds. We can also make our local environment production friendly. We will see that in a bit.</p>
<h2 id="heading-the-airflow-project">The Airflow Project</h2>
<ol>
<li><p><code>standalone_admin_password.txt</code> has the password for your local airflow password is <code>admin</code></p>
</li>
<li><p><code>dags</code> the folder is where you add your dags, a.k.a pipelines</p>
</li>
<li><p><code>logs</code> the folder will have your logs</p>
</li>
</ol>
<h2 id="heading-using-localexecutor">Using LocalExecutor</h2>
<p>Airflow, by default, used <code>SequentialExecutor</code> this is not great for production-level systems. When exploring Airflow, it is a good idea to use <code>LocalExecutor</code> along with Postgres or MySQL right away.</p>
<p>Stop the Airflow instance. Update the <code>airflow.cfg</code> with the following configs. Replace <code>sql_alchemy_conn</code> with respective DB credentials</p>
<pre><code class="lang-apache">[<span class="hljs-attribute">core</span>]
<span class="hljs-attribute">load_examples</span> = False
<span class="hljs-attribute">executor</span>=LocalExecutor

[<span class="hljs-attribute">database</span>]
<span class="hljs-attribute">sql_alchemy_conn</span> = postgresql://&lt;pg-user&gt;:&lt;pg-password&gt;@&lt;host&gt;:&lt;port:<span class="hljs-number">5423</span>&gt;/&lt;db-name&gt;
</code></pre>
<p>If you use Postgres, You will need <code>psycopg</code> the library, which you can download using the following command</p>
<p><code>pip install "apache-airflow[postgres]"</code></p>
<h2 id="heading-sample-dag">Sample Dag</h2>
<p>Copy the example dag from the <a target="_blank" href="https://github.com/apache/airflow/blob/2.4.3/airflow/example_dags/example_bash_operator.py">Airflow repo</a> and place it under the dags folder</p>
<h2 id="heading-restart-airflow">Restart Airflow</h2>
<p><code>airflow standalone</code></p>
<hr />
<p>Airflow has a steep learning curve. I can help you adopt Airflow for your Data engineering pipeline and your team's ecosystem. <a target="_blank" href="https://zcal.co/bhavaniravi/airflow-starterpack">Schedule a free call today.</a></p>
]]></content:encoded></item></channel></rss>