Apache's Opensource World Of Data Tools

Apache is the opensource hub for various datatools. Let's explore them and understand which tool fits your usecase

Apache software foundation entering it's 22nd year has paved way for tonnes of data tools that sets state-of-the-art methods, models, and guidelines for Data engineers and DevOps practitioners to think about their Data infrastructure. The goal of this post is to highlight some of the famous tools and compare and contrast them

Workflow Orchestration

Apache Airflow

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows in Python. It is used for managing data pipelines and can integrate with other tools like Spark, Hive, and Kudu.

The open-ended nature of the tool is great for small ETL jobs. The extensive providers support gives hooks to all major platforms and data tools.

We cannot talk about Airflow without mentioning its community support. The slack channel is always buzzing with questions and people helping each other out. The documentation gets better every day. 9K+ questions of Stackoverflow and a tool that's forever improving.

  • Workflow management

  • Small ETL jobs

  • One-off/schedule-based tasks

  • Backups

@dag(
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False
)
def example_dag_decorator(email: str = "example@example.com"):
    get_ip = GetRequestOperator(task_id="get_ip", url="http://httpbin.org/get")

    @task(multiple_outputs=True)
    def send_email(raw_json: dict[str, Any]) -> dict[str, str]:
        # write logic here
        pass

    email_info = send_email(get_ip.output)

example_dag = example_dag_decorator()

Apache Oozie

Apache Oozie is a workflow orchestrator similar to Airflow written in Java and particularly meant to manage Hadoop jobs and support MapReduce, Pig, and Hive.

In Comparision with Airflow the Workflow Automation market, Apache Airflow has a 33.72% market share in comparison to Apache Oozie’s 5.93%

Apache Luigi

Apache Luigi developed by Spotify is a Python-based workflow management system that is used for scheduling and managing multi-node data pipelines. It is simple to use and is suited for small to medium-sized data workflows.

Though by definition Luigi sounds like Airflow, Luigi is based on pipelines of tasks that share input and output information and is target-based. Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.

Workflows in Luigi are made of Tasks and Targets. Tasks are classes with well-defined input, processing and output. Targets are data sources or destinations.


class NewTask(luigi.Task):

    def output(self):
        return luigi.LocalTarget("data/output_two.txt")

    def requires(self):
        return OldTask()

    def run(self):
        with self.input().open("r") as input_file:
            line = input_file.read()

        with self.output().open("w") as output_file:
            decorated_line = "My "+line
            output_file.write(decorated_line)

Data-Processing/Analytics

Apache Spark

6 Lakh records imagine loading that into a spreadsheet or pandas dataframe to perform some transformation. Impossible unless you're fine with your system giving up on you. We need a tool that can smartly chunk these data and process them independently

Apache Spark is a fast and general-purpose engine for large-scale data processing. It is designed to process data in parallel and can handle batch, streaming, and machine-learning workloads. It supports multiple languages like Python, Java, R.

The main advantage of Spark is that it provides a resilient distributed dataset (RDD)(More like a Pandas dataframe), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources

url = \
  "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"
df = sqlContext \
  .read \
  .format("jdbc") \
  .option("url", url) \
  .option("dbtable", "people") \
  .load()

# Looks the schema of this DataFrame.
df.printSchema()
countsByAge = df.groupBy("age").count()

countsByAge.write.format("json").save("s3a://...")

Apache Beam

Apache beam is a streaming ETL tool that can handle batches as a special case of stream processing

A beam like every other orchestrator tool consists of Pipelines. But what makes it different is it each node in the pipeline can be a PCollection(Data) node or PTransform(function) node. They are also great for both batch and stream-based data processing.

The best part about Apache Beam is the multi-language support of Java, Python, Go, and Scala.

Apache Flink is an open-source, distributed stream processing framework for real-time, event-driven data processing. It supports batch and stream processing and is designed for scalability, fault tolerance, and low latency.

Flink is great for real-time reporting, fraud detection, stream analytics

Apache Storm

Apache Storm is a free and open-source distributed real-time computation system. is great for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant guarantees your data will be processed, and is easy to set up and operate.

Apache Storm integrates with the queueing and database technologies you already use. Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.

Apache Samza

Apache Samza developed by Linkedin is a distributed stream processing framework that is designed for real-time event processing. It supports stateful stream processing and integrates with Apache Kafka for its messaging layer.

While Apache Beam lets us define the workflow definition, Samza can be the actual executor of those transformations.

Samza is a newer ecosystem compared to Storm and is specifically designed to work well for Apache Kafka. Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are ordered sequences where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.

Apache Kudu

Apache Kudu is an open-source storage system for fast analytics on fast data. It supports low latency, random access, and efficient compression and is suitable for use cases such as online serving, real-time analytics, and offline batch processing.

Visualization

Apache Superset

Apache Superset is a modern, enterprise-ready business intelligence web application. It is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data.

Messaging/Queuing

Apache Kafka

The moment you hear stream processing in the data world you cannot ignore Apache Kafka. The event streaming platform is used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Apache Kafka is scalable, durable, reliable and high performant which makes it a perfect tool of choice for large volumes of data streaming.

Common usecases for Kafka includes

  • Internet of Things and stream processing

  • Log aggregation

  • Large-scale messaging

  • Customer activity tracking

  • Operational alerting and reporting

Storage

Apache Hive

Apache Hive is a data warehousing and SQL-like query language for big data. It provides an interface to analyze structured data stored in the Hadoop file system (HDFS) using SQL-like syntax.

Apache Parquet

Apache Parquet is a columnar storage format for Hadoop that provides efficient data compression and encoding algorithms. It is optimized for interactive and batch analytics and supports advanced features like predicate pushdown.

Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

As of Dec 2022, these tables can also be used by BigQuery for large data analytics


Yes, Apache has an overwhelming number of tools to pick from for your data infrastructure. Choose the ones that fit you the best based on use case, community support, maintenance and the skill level of your team.

Let's discuss your data infrastructure needs. Schedule a free call today

Did you find this article valuable?

Support Data and DevOps by becoming a sponsor. Any amount is appreciated!