At the moment, Apache Spark is the leading platform of choice for large-scale batch processing, machine learning, stream processing, and SQL manipulation. Ever since its release back in 2009, the platform has seen a large upward curve in terms of adoption rate and community support.
Built by over 1200 developers with contributions made by 300+ organizations and hosted by the vendor-neutral Apache Software Foundation, Apache Spark flaunts one of the largest communities in the big data processing segment.
What is Apache Spark?
Apache Spark is a unified analytics engine for big data. Written in Scala, it is an open-source, distributed cluster-computing framework. Spark provides the ability to program an entire cluster, a network of distributed computers, with implicit data parallelism and fault tolerance.
Although Apache Spark is available freely as a standalone distribution, for those interested in a comprehensive, enterprise managed service (usually following a pay-as-you-go scheme), it is available with:
- Amazon EMR (Elastic MapReduce)
- Google Cloud Dataproc
- Microsoft Azure HDInsight
- Unified Data Analytics Platform from Databricks
The distributed big data processing framework is capable of performing rapid processing tasks on data sets of mammoth proportions. It distributes data processing tasks across clusters, either by itself or in line with other distributed computing tools like those mentioned above.
These qualities of Apache Spark not only makes it an ideal tool for churning big data but an excellent machine learning tool as well. Organizations typically employ Apache Spark to crunch through Brobdingnagian data sets and come out with profitable insights as well as build robust applications that can do the same.
The intuitive Spark API abstracts the complexities of big data processing and distributed computing, making it easy for developers to use the distributed big data processing framework as per their requirements without worrying about going too deep into the mechanics.
Due to its general-purpose nature, the cluster-computing framework is employed by organizations across a diverse range of market segments. Apple, eBay, Facebook, IBM, Netflix, and Yahoo are some of the big names leveraging Apache Spark.
So, now that we’ve built a brief understanding of Apache Spark, it’s time to press forward and discuss the Apache Spark architecture.
The Apache Spark Architecture - What and How?
Any Apache Spark application is composed of two primary components:
- A Driver - Converts the user code into several tasks that are distributed among the various worker nodes.
- Executor Processes - Run on the worker nodes and execute tasks that are assigned to them.
Following the principle of distributed computing, Apache Spark relies on a driver core process that splits an application into several tasks and distributes the same among several executor processes that perform the sub-tasks.
One of the amazing abilities of these executors is that they can be scaled up or down as per the requirements of the application to which they belong.
To keep these two components, the driver and the executors, in sync and also to keep the whole process efficient, i.e., allocating worker nodes in an optimum way, there is a requirement for a resource or cluster management system.
By default, Apache Spark runs in the standalone cluster mode. This requires every machine belonging to the cluster having the Apache Spark framework and a JVM. This, however, is a basic cluster manager, and there are better options available that one can benefit from.
This is the reason why in enterprise scenarios, Hadoop YARN is used as the cluster management system for Apache Spark. Spark, nonetheless, can also run on top of Apache Mesos, Docker Swarm, and Kubernetes.
Apache Spark features a DAG (Directed Acyclic Graph) scheduler. It is the scheduling layer of the cluster-computing framework. The data processing commands are built into the DAG by Spark. In simple words, it determines the what and when of tasks and worker nodes.
The Apache Spark Ecosystem
The Apache Spark Ecosystem includes many modules, which we’ll discuss on a one-on-one basis:
1. Spark Core
Everything in Apache Spark is built on top of the Spark Core, the underlying execution engine of the cluster-computing framework. It features:
- Easy development with Java, Python, and Scala APIs
- In-memory computing capability for speedier execution, and
- Support for a wide variety of applications with a generalized execution model
Most of the Apache Spark Core API is built over the concept of RDD (Resilient Distributed Dataset). While the Spark Core facilitates the classic map and reduce functionality, it also offers inbuilt support for:
- Joining datasets, and
2. Spark SQL and DataFrames
Intended for structured data processing, Spark SQL - previously known as Shark - module acts as a distributed SQL query engine and offers a programming abstraction dubbed DataFrames.
Spark SQL allows unmodified Hadoop Hive queries to run faster on existing data and deployments. It also offers robust integration options with the rest of the Spark ecosystem, such as integrating SQL query processing with machine learning.
This structured data processing module provides support for a SQL 2003-like interface, making it equally usable by developers as well as analysts. Spark SQL offers an out-of-the-box standard interface for reading from as well as writing to:
- Apache Hive
- Apache ORC
- Apache Parquet
Support for other popular data stores, such as Apache Cassandra, Apache HBase, and MongoDB, are available via using distinct connectors from the Spark Packages. Once a data frame is registered as a temporary table, SQL queries can be used on top of it.
Under the hood, Apache Spark leverages Catalyst - a query optimizer - that is responsible for inspecting data and queries for devising an appropriate query plan for data locality and computation to perform the desired calculations across an entire cluster.
3. Spark RDDs
RDDs stands for Resilient Distributed Datasets. It is a programming abstraction representing an immutable collection of objects, split across an entire computing cluster.
These allow achieving fast and scalable parallel processing using the ability to split operations on the RDDs across the cluster and then executing them in a parallel batch process. It is possible to create RDDs from a range of entities, including:
- Amazon S3 buckets
- NoSQL data stores like Apache Cassandra and MongoDB
- Plain text files
- SQL databases
Note: Although the RDD interface is still supported in Spark 2 and higher, using the Spark SQL module is officially recommended. You should use the RDD interface only in scenarios where the Spark SQL module fails in catching up with all the data processing requirements of a project.
4. Spark GraphX
Spark GraphX is a graph computation engine that allows users to interactively build, manage, and transform graph-structured data. It features an array of distributed algorithms for processing graph structures, even including an implementation of Google PageRank.
Algorithms available to GraphX make use of Spark Core’s RDD approach of modeling data. The GraphFrames package allows doing graph operations on data frames while benefiting from the Catalyst for graph queries.
5. Spark Streaming
During the early days of Apache Hadoop, batch and stream processing were two separate concepts. For batch processing, writing MapReduce code was required, while for real-time streaming, something like Apache Storm was employed.
Modern applications, however, need not be limited by the ability to analyze and process batch data. Most even require processing streams of real-time data.
A major downside to the classic Hadoop approach was that unalike codebases were required to be kept in sync. This was difficult, if not frustrating, as the frameworks on which technologies were based, resources required, operational approaches; all were different.
To address this seemingly paradoxical situation, Apache Spark featured the Spark Streaming module. It walks along with the ease-of-use and fault tolerance standards of Spark while offering a powerful interface for applications requiring the processing of both stored and live forms of data.
6. Micro batching
The Spark Streaming module can be quickly integrated with a range of popular data sources, such as Flume, HDFS, Kafka, and even Twitter. It breaks down the stream of data into a continuous series of micro-batches, which can, after that, be manipulated using the Spark API.
This allowed the code in batch as well as streaming forms to mostly share the same code while running on the same framework. Micro batching, however, was criticized for being slow in cases where a low-latency response for incoming data was required.
7. Structured Streaming
Other streaming-oriented frameworks, like Apache Apex, and Apache Storm, offered better performance in the scenarios above owing to their use of a pure streaming method. This was resolved using Structured Streaming, a feature made available in Apache Spark 2.x versions.
The higher-level API allows developers to create infinite streaming dataframes and datasets. It also resolved several pressing issues such as:
- Event-time aggregations, and
- Late delivery of messages
Each query on structured streams goes via Catalyst in addition to having the ability to run interactively. This allows performing SQL queries for live streaming data.
8. Continuous Processing
Before Apache Spark 2.3, structured streams relied on the micro batching scheme. This was appended in Spark 2.3 with the addition of Continuous Processing, a low-latency processing mode. It, impressively, allows handling responses with latencies that are as low as 1 ms.
Continuous Streaming mode provides support for a restricted set of queries and is still labeled experimental in Spark 2.4.
9. Spark MLib
The MLib module is a scalable machine learning library in Apache Spark Ecosystem that allows employing speedy, high-quality ML-algorithms. It can be included in complete workflows as the library is usable in Java, Python, and Scala programming languages.
With the Spark MLib, module developers can create efficient machine learning pipelines. Also, they can have an easy time implementing feature extraction, selections, and transformation on structured datasets.
MLib features distributed implementations of clustering and classification algorithms to the likes of k-means clustering and random forests. These can be easily added to or removed from custom ml pipelines.
Data scientists can leverage R or Python for training machine learning models in Apache Spark that can be after that imported into a Java or Scala pipeline for an almost-instantaneous production-use.
MLib, however, can only be employed for basic ml tasks, namely classification, clustering, filtering, and regression. For modeling and training deep neural networks, Deep Learning Pipelines (still in development) are there.
10. Deep Learning Pipelines
The support for deep learning is offered in Apache Spark using Deep Learning Pipelines. These leverage the existing MLib pipeline structure for:
- Applying custom Keras models/TensorFlow graphs to incoming data
- Calling into construct classifiers and lower-level deep learning libraries
Deep Learning Pipelines even allows applying deep learning models to available data as part of SQL statements. It does so by registering the Keras models and/or TensorFlow graphs as custom Spark SQL user-defined functions (UDFs).
Why Spark and Not Hadoop?
Both Apache Spark and Apache Hadoop are sister technologies, kind of. Although both share several similarities and are essentially big data processing platforms, several differences somewhat make them different big data processing technologies.
Interestingly, Spark is included in many of the recent Apache Hadoop distributions. Hadoop rose to prominence thanks to its MapReduce paradigm. Apache Spark, however, has a better alternative at its disposal as compared to those above.
In this comparison, we’ll restrict comparing the two Apache products on the base of two major parameters; speed and developer-friendliness.
Apache Spark features an in-memory data engine. This allows the big data processing framework to run workloads faster. It can be up to 100 times faster than the MapReduce paradigm (Apache Hadoop) in some specific scenarios.
Spark is highly performant for multi-stage jobs requiring the writing of state to the disk among stages, for instance. Apache Spark jobs in which the data can’t be fully contained within the memory are still about ten times faster than the same running with the MapReduce technique.
The MapReduce processing technique is limited to a two-stage execution graph involving data mapping and reducing. Spark’s DAG (Directed Acyclic Graph) scheduler, on the other hand, features multiple stages capable of being distributed much more efficiently.
Apache Spark is built with developer-friendliness in mind. In addition to offering bindings for popular data analysis programming languages like R and Python, the cluster-computing framework also provides bindings for versatile programming languages such as Java and Scala.
Due to its ease-of-use, the advantages offered by Apache Spark can be leveraged by anyone ranging from seasoned application developers and dedicated data scientists to ardent learners. Applications can be written quickly in Java, Python, R, Scala, and SQL using Apache Spark.
The cluster-computing framework features 80+ high-level operators facilitating the development of parallel apps. Moreover, Apache Spark can be used interactively via Python, R, Scala, and SQL shells.
To sum up, Apache Hadoop is an old big data processing platform. At the same time, Apache Spark is a modern, more efficient big data processing platform extending the abilities of its predecessor, i.e., Hadoop.
Want to go in-depth about the differences between Hadoop and Spark? Check out this detailed Hadoop vs. Spark comparison.
Apache Spark is by far the most popular distributed big data processing framework at the moment. Even though it has a complex underlying architecture, the way it is abstracted to offer an easier time working with the same is truly a remarkable feat.
For anyone fascinated by big data processing and distributed computing, learning Spark is undoubtedly advantageous and a big yes!
We prepared a list of the Best Apache Spark Courses if you want to start learning now.
The innate user-friendliness of the distributed cluster-computing framework paired with humongous, active community support makes understanding and working with Apache Spark a rewarding and digitally enlightening undertaking. So, all the very best!
Are you preparing for an Apache Spark-based interview? Check out these best Apache Spark interview questions.
People are also reading:
- Difference between Hadoop Mapreduce vs Apache Spark
- Difference between Hadoop vs Apache
- What is Hadoop?
- Top Hadoop Components
- Hadoop Architecture
- Best Hadoop Certifications
- Kubernetes Certification
- GCP Certification
- Best Microsoft Azure Certification
- Data Science Certification
- Best DevOps Certification