Apache's Opensource World Of Data Tools
Apache is the opensource hub for various datatools. Let's explore them and understand which tool fits your usecase
Apache software foundation entering it's 22nd year has paved way for tonnes of data tools that sets state-of-the-art methods, models, and guidelines for Data engineers and DevOps practitioners to think about their Data infrastructure. The goal of this post is to highlight some of the famous tools and compare and contrast them
Workflow Orchestration
Apache Airflow
Apache Airflow is a platform to programmatically author, schedule, and monitor workflows in Python. It is used for managing data pipelines and can integrate with other tools like Spark, Hive, and Kudu.
The open-ended nature of the tool is great for small ETL jobs. The extensive providers support gives hooks to all major platforms and data tools.
We cannot talk about Airflow without mentioning its community support. The slack channel is always buzzing with questions and people helping each other out. The documentation gets better every day. 9K+ questions of Stackoverflow and a tool that's forever improving.
Workflow management
Small ETL jobs
One-off/schedule-based tasks
Backups
@dag(
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False
)
def example_dag_decorator(email: str = "example@example.com"):
get_ip = GetRequestOperator(task_id="get_ip", url="http://httpbin.org/get")
@task(multiple_outputs=True)
def send_email(raw_json: dict[str, Any]) -> dict[str, str]:
# write logic here
pass
email_info = send_email(get_ip.output)
example_dag = example_dag_decorator()
Apache Oozie
Apache Oozie is a workflow orchestrator similar to Airflow written in Java and particularly meant to manage Hadoop jobs and support MapReduce, Pig, and Hive.
In Comparision with Airflow the Workflow Automation market, Apache Airflow has a 33.72% market share in comparison to Apache Oozie’s 5.93%
Apache Luigi
Apache Luigi developed by Spotify is a Python-based workflow management system that is used for scheduling and managing multi-node data pipelines. It is simple to use and is suited for small to medium-sized data workflows.
Though by definition Luigi sounds like Airflow, Luigi is based on pipelines of tasks that share input and output information and is target-based. Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically.
Workflows in Luigi are made of Tasks and Targets. Tasks are classes with well-defined input, processing and output. Targets are data sources or destinations.
class NewTask(luigi.Task):
def output(self):
return luigi.LocalTarget("data/output_two.txt")
def requires(self):
return OldTask()
def run(self):
with self.input().open("r") as input_file:
line = input_file.read()
with self.output().open("w") as output_file:
decorated_line = "My "+line
output_file.write(decorated_line)
Data-Processing/Analytics
Apache Spark
6 Lakh records imagine loading that into a spreadsheet or pandas dataframe to perform some transformation. Impossible unless you're fine with your system giving up on you. We need a tool that can smartly chunk these data and process them independently
Apache Spark is a fast and general-purpose engine for large-scale data processing. It is designed to process data in parallel and can handle batch, streaming, and machine-learning workloads. It supports multiple languages like Python, Java, R.
The main advantage of Spark is that it provides a resilient distributed dataset (RDD)(More like a Pandas dataframe), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel
Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources
url = \
"jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"
df = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "people") \
.load()
# Looks the schema of this DataFrame.
df.printSchema()
countsByAge = df.groupBy("age").count()
countsByAge.write.format("json").save("s3a://...")
Apache Beam
Apache beam is a streaming ETL tool that can handle batches as a special case of stream processing
A beam like every other orchestrator tool consists of Pipelines. But what makes it different is it each node in the pipeline can be a PCollection(Data) node or PTransform(function) node. They are also great for both batch and stream-based data processing.
The best part about Apache Beam is the multi-language support of Java, Python, Go, and Scala.
Apache Flink
Apache Flink is an open-source, distributed stream processing framework for real-time, event-driven data processing. It supports batch and stream processing and is designed for scalability, fault tolerance, and low latency.
Flink is great for real-time reporting, fraud detection, stream analytics
Apache Storm
Apache Storm is a free and open-source distributed real-time computation system. is great for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant guarantees your data will be processed, and is easy to set up and operate.
Apache Storm integrates with the queueing and database technologies you already use. Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.
Apache Samza
Apache Samza developed by Linkedin is a distributed stream processing framework that is designed for real-time event processing. It supports stateful stream processing and integrates with Apache Kafka for its messaging layer.
While Apache Beam lets us define the workflow definition, Samza can be the actual executor of those transformations.
Samza is a newer ecosystem compared to Storm and is specifically designed to work well for Apache Kafka. Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are ordered sequences where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.
Apache Kudu
Apache Kudu is an open-source storage system for fast analytics on fast data. It supports low latency, random access, and efficient compression and is suitable for use cases such as online serving, real-time analytics, and offline batch processing.
Visualization
Apache Superset
Apache Superset is a modern, enterprise-ready business intelligence web application. It is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data.
Messaging/Queuing
Apache Kafka
The moment you hear stream processing in the data world you cannot ignore Apache Kafka. The event streaming platform is used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Apache Kafka is scalable, durable, reliable and high performant which makes it a perfect tool of choice for large volumes of data streaming.
Common usecases for Kafka includes
Internet of Things and stream processing
Log aggregation
Large-scale messaging
Customer activity tracking
Operational alerting and reporting
Storage
Apache Hive
Apache Hive is a data warehousing and SQL-like query language for big data. It provides an interface to analyze structured data stored in the Hadoop file system (HDFS) using SQL-like syntax.
Apache Parquet
Apache Parquet is a columnar storage format for Hadoop that provides efficient data compression and encoding algorithms. It is optimized for interactive and batch analytics and supports advanced features like predicate pushdown.
Apache Iceberg
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
As of Dec 2022, these tables can also be used by BigQuery for large data analytics
Yes, Apache has an overwhelming number of tools to pick from for your data infrastructure. Choose the ones that fit you the best based on use case, community support, maintenance and the skill level of your team.
Let's discuss your data infrastructure needs. Schedule a free call today