Apache Spark is one of the most popular distributed, general-purpose cluster-computing frameworks. The open-source tool offers an interface for programming an entire computer cluster with implicit data parallelism and fault tolerance features.
Apache Spark Interview Questions
Here we have compiled a list of the 20 most important Spark interview questions. These will help you gauge your Apache Spark preparation for cracking that upcoming interview. Do you think you can get the answers right? Well, you’ll only know once you’ve gone through it!
Q: Can you explain the key features of Apache Spark?
- Support for Several Programming Languages – Spark code can be written in any of the four programming languages, namely Java, Python, R, and Scala. It also provides high-level APIs in these programming languages. Additionally, Apache Spark provides shells in Python and Scala. The Python shell is accessed through the ./bin/pyspark directory, while for accessing the Scala shell one needs to go to the .bin/spark-shell directory.
- Lazy Evaluation – Apache Spark makes use of the concept of lazy evaluation, which is to delay the evaluation up until the point it becomes absolutely compulsory.
- Machine Learning – For big data processing, Apache Spark’s MLib machine learning component is useful. It eliminates the need for using separate engines for processing and machine learning.
- Multiple Format Support – Apache Spark provides support for multiple data sources, including Cassandra, Hive, JSON, and Parquet. The Data Sources API offers a pluggable mechanism for accessing structured data via Spark SQL. These data sources can be much more than just simple pipes able to convert data and pulling the same into Spark.
- Real-Time Computation – Spark is designed especially for meeting massive scalability requirements. Thanks to its in-memory computation, Spark’s computation is real-time and has less latency.
- Speed – For large-scale data processing, Spark can be up to 100 times faster than Hadoop MapReduce. Apache Spark is able to achieve this tremendous speed via controlled portioning. The distributed, general-purpose cluster-computing framework manages data by means of partitions that help in parallelizing distributed data processing with minimal network traffic.
- Hadoop Integration – Spark offers smooth connectivity with Hadoop. In addition to being a potential replacement for the Hadoop MapReduce functions, Spark is able to run on top of an extant Hadoop cluster by means of YARN for resource scheduling.
Q: What advantages does Spark offer over Hadoop MapReduce?
- Enhanced Speed – MapReduce makes use of persistent storage for carrying out any of the data processing tasks. On the contrary, Spark uses in-memory processing that offers about 10 to 100 times faster processing than the Hadoop MapReduce.
- Multitasking – Hadoop only supports batch processing via inbuilt libraries. Apache Spark, on the other end, comes with built-in libraries for performing multiple tasks from the same core, including batch processing, interactive SQL queries, machine learning, and streaming.
- No Disk-Dependency – While Hadoop MapReduce is highly disk-dependent, Spark mostly uses caching and in-memory data storage.
- Iterative Computation – Performing computations several times on the same dataset is termed as iterative computation. Spark is capable of iterative computation while Hadoop MapReduce isn’t.
Q: Please explain the concept of RDD (Resilient Distributed Dataset). Also, state how you can create RDDs in Apache Spark.
A: An RDD or Resilient Distribution Dataset is a fault-tolerant collection of operational elements that are capable to run in parallel. Any partitioned data in an RDD is distributed and immutable.
Fundamentally, RDDs are portions of data that are stored in the memory distributed over many nodes. These RDDs are lazily evaluated in Spark, which is the main factor contributing to the hastier speed achieved by Apache Spark. RDDs are of two types:
- Hadoop Datasets – Perform functions on each file record in HDFS (Hadoop Distributed File System) or other types of storage systems
- Parallelized Collections – Extant RDDs running parallel with one another
There are two ways of creating an RDD in Apache Spark:
- By parallelizing a collection in the Driver program. It makes use of SparkContext’s parallelize() method. For instance:
method val DataArray = Array(22,24,46,81,101) val DataRDD = sc.parallelize(DataArray)
- By means of loading an external dataset from some external storage, including HBase, HDFS, and shared file system
Q: What are the various functions of Spark Core?
A: Spark Core acts as the base engine for large-scale parallel and distributed data processing. It is the distributed execution engine used in conjunction with the Java, Python, and Scala APIs that offer a platform for distributed ETL (Extract, Transform, Load) application development.
Various functions of Spark Core are:
- Distributing, monitoring, and scheduling jobs on a cluster
- Interacting with storage systems
- Memory management and fault recovery
Furthermore, additional libraries built on top of the Spark Core allow it to diverse workloads for machine learning, streaming, and SQL query processing.
Q: Please enumerate the various components of the Spark Ecosystem.
- GraphX – Implements graphs and graph-parallel computation
- MLib – Used for machine learning
- Spark Core – Base engine used for large-scale parallel and distributed data processing
- Spark Streaming – Responsible for processing real-time streaming data
- Spark SQL – Integrates Spark’s functional programming API with relational processing
Q: Is there any API available for implementing graphs in Spark?
A: GraphX is the API used for implementing graphs and graph-parallel computing in Apache Spark. It extends the Spark RDD with a Resilient Distributed Property Graph. It is a directed multi-graph that can have several edges in parallel.
Each edge and vertex of the Resilient Distributed Property Graph has user-defined properties associated with it. The parallel edges allow for multiple relationships between the same vertices.
In order to support graph computation, GraphX exposes a set of fundamental operators, such as joinVertices, mapReduceTriplets, and subgraph, and an optimized variant of the Pregel API.
The GraphX component also includes an increasing collection of graph algorithms and builders for simplifying graph analytics tasks.
Q: Tell us how will you implement SQL in Spark?
A: Spark SQL modules help in integrating relational processing with Spark’s functional programming API. It supports querying data via SQL or HiveQL (Hive Query Language).
Also, Spark SQL supports a galore of data sources and allows for weaving SQL queries with code transformations. DataFrame API, Data Source API, Interpreter & Optimizer, and SQL Service are the four libraries contained by the Spark SQL.
Q: What do you understand by the Parquet file?
A: Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:
- Able to fetch specific columns for access
- Consumes less space
- Follows type-specific encoding
- Limited I/O operations
- Offers better-summarized data
Q: Can you explain how you can use Apache Spark along with Hadoop?
A: Having compatibility with Hadoop is one of the leading advantages of Apache Spark. The duo makes up for a powerful tech pair. Using Apache Spark and Hadoop allows for making use of Spark’s unparalleled processing power in line with the best of Hadoop’s HDFS and YARN abilities.
Following are the ways of using Hadoop Components with Apache Spark:
- Batch & Real-Time Processing – MapReduce and Spark can be used together where the former handles the batch processing and the latter is responsible for real-time processing
- HDFS – Spark is able to run on top of the HDFS for leveraging the distributed replicated storage
- MapReduce – It is possible to use Apache Spark along with MapReduce in the same Hadoop cluster or independently as a processing framework
- YARN – Spark applications can run on YARN
Q: Name various types of Cluster Managers in Spark.
- Apache Mesos – Commonly used cluster manager
- Standalone – A basic cluster manager for setting up a cluster
- YARN – Used for resource management
Q: Is it possible to use Apache Spark for accessing and analyzing data stored in Cassandra databases?
A: Yes, it is possible to use Apache Spark for accessing as well as analyzing data stored in Cassandra databases using the Spark Cassandra Connector. It needs to be added to the Spark project during which a Spark executor talks to a local Cassandra node and will query only local data.
Connecting Cassandra with Apache Spark allows making queries faster by means of reducing the usage of network for sending data between Spark executors and Cassandra nodes.
Q: What do you mean by the worker node?
A: Any node that is capable of running the code in a cluster can be said to be a worker node. The driver program needs to listen for incoming connections and then accept the same from its executors. Additionally, the driver program must be network addressable from the worker nodes.
A worker node is basically a slave node. The master node assigns work which the worker node then performs. Worker nodes process data stored on the node and report the resources to the master node. The master node schedule tasks based on resource availability.
Q: Please explain the sparse vector in Spark.
A: A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays:
- One for indices
- The other for values
An example of a sparse vector is as follows:
Q: How will you connect Apache Spark with Apache Mesos?
A: Step by step procedure for connecting Apache Spark with Apache Mesos is:
- Configure the Spark driver program to connect with Apache Mesos
- Put the Spark binary package in a location accessible by Mesos
- Install Apache Spark in the same location as that of the Apache Mesos
- Configure the spark.mesos.executor.home property for pointing to the location where the Apache Spark is installed
Q: Can you explain how to minimize data transfers while working with Spark?
A: Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:
- Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
- Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
- Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs
Q: What are broadcast variables in Apache Spark? Why do we need them?
A: Rather than shipping a copy of a variable with tasks, a broadcast variable helps in keeping a read-only cached version of the variable on each machine.
Broadcast variables are also used to provide every node with a copy of a large input dataset. Apache Spark tries to distribute broadcast variable by using effectual broadcast algorithms for reducing communication cost.
Using broadcast variables eradicates the need of shipping copies of a variable for each task. Hence, data can be processed quickly. Compared to an RDD lookup(), broadcast variables assist in storing a lookup table inside the memory that enhances the retrieval efficiency.
Q: Please provide an explanation on DStream in Spark.
A: DStream is a contraction for Discretized Stream. It is the basic abstraction offered by Spark Streaming and is a continuous stream of data. DStream is received from either a processed data stream generated by transforming the input stream or directly from a data source.
A DStream is represented by a continuous series of RDDs, where each RDD contains data from some certain interval. An operation applied to a DStream is analogous to applying the same operation on the underlying RDDs. A DStream has two operations:
- Output operations responsible for writing data to an external system
- Transformations resulting in the production of a new DStream
It is possible to create DStream from various sources, including Apache Kafka, Apache Flume, and HDFS. Also, Spark Streaming provides support for several DStream transformations.
Q: Does Apache Spark provide checkpoints?
A: Yes, Apache Spark provides checkpoints. They allow for a program to run all around the clock in addition to making it resilient towards failures not related to application logic. Lineage graphs are used for recovering RDDs from a failure.
Apache Spark comes with an API for adding and managing checkpoints. The user then decides which data to the checkpoint. Checkpoints are preferred over lineage graphs when the latter are long and have wider dependencies.
Q: What are the different levels of persistence in Spark?
A: Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.
Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:
- DISK_ONLY – Stores the RDD partitions only on the disk.
- MEMORY_AND_DISK – Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
- MEMORY_ONLY_SER – Stores RDD as serialized Java objects with one byte array per partition.
- MEMORY_AND_DISK_SER – Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
- MEMORY_ONLY – The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting into recomputing the same on the fly every time they are required.
- OFF_HEAP – Works like MEMORY_ONLY_SER but stores the data in off-heap memory.
Q: Can you list down the limitations of using Apache Spark?
- Doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefitting from a file management system
- Higher latency but consequently, lower throughput
- No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is a micro-batch processing and not truly real-time data processing
- Lesser number of algorithms available
- Spark streaming doesn’t support record-based window criteria
- The work needs to be distributed over multiple clusters instead of running everything on a single node
- While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck
That completes the list of the 20 important Spark interview questions. Going through these questions will allow you to check your Spark knowledge as well as help prepare for an upcoming Apache Spark interview. All the best.
How many of the aforementioned questions did you already know the answers to? Which questions should or shouldn’t have made it to the list? Let us know via comments! Consider checking out these best Spark tutorials to further refine your Apache Spark skills.
You Might be Interested in Other Interview Questions: