Apache Hadoop is one of the most popular open-source projects for churning out Big Data. It is a powerful technology that allows organizations and individuals to make sense out of huge chunks of data, especially unstructured, in an efficient way while staying cost-effective.
Several job profiles in the IT sector concerning Big Data requires a good understanding of Apache Hadoop.
Top Hadoop Interview Questions and Answers
If you’re preparing for such an interview, here are the best Hadoop interview questions to prepare for the same or gauge your progress up until now:
Question: What is Hadoop? Name its components.
Answer: Apache Hadoop is an open-source software framework that offers a galore of tools and services to store and process Big Data. Decision-makers leverage Hadoop for analyzing Big Data and come up with fitting business decisions. Hadoop has the following components:
- Processing framework
- Storage unit
Question: Compare relational database management systems with HDFS (Hadoop Distributed File System)?
Answer: Following are the various differences between HDFS and RDBMS:
- Data Storage - In an RDBMS, the schema of the data is always known and only structured data is stored. On the contrary, Hadoop can store structured, semi-structured, and even unstructured data.
- Processing Ability - An RDBMS have little to no processing capabilities. Hadoop, on the other hand, allows processing data, which is parallelly distributed across the Hadoop cluster.
- Schema Approach - Another major distinction between HDFS and an RDBMS is the schema approach. While RDBMS follows the traditional schema-on-write approach, where the schema is validated prior to loading the data, the HDFS follows the modern schema-on-read approach.
- Read/Write Speeds - Reads are fast in RDBMSs as the schema is already known. Hadoop fosters faster writes as there is no schema validation during an HDFS write.
- Pricing - Most RDBMSs are paid software. Hadoop, contrarily, is an open-source framework with a wide community and a plethora of additional software like tools and libraries.
- Ideal Usage - The use of an RDBMS is limited to OLTP systems. Hadoop, however, can be employed for data discovery, analytics, OLAP systems, etc.
Question: Please explain HDFS and YARN?
Answer: HDFS or Hadoop Distributed File System is the storage unit of Apache Hadoop. Following the master/slave topology, the HDFS stores several forms of data as blocks in a distributed environment. It has two components:
- NameNode - It is the master node that maintains metadata pertaining to the stored data blocks.
- DataNodes - Slave nodes that store data in the HDFS. All DataNodes are managed by the NameNode.
Yet Another Resource Negotiator or YARN is the processing framework of Apache Hadoop, introduced in Hadoop 2. It is responsible for managing resources along with offering an execution environment for the processes. YARN has 2 components:
- ResourceManager - Receives the processing requests, which it then passes, in parts, accordingly to the relevant NodeManagers. Also allocates resources to the apps, depending on their needs.
- NodeManager - Installed on each DataNode, responsible for executing tasks.
Question: Explain the various Hadoop daemons?
Answer: There are a total of 6 Hadoop daemons in a Hadoop cluster:
- NameNode - This is the master node that stores metadata of all the directories and files of a Hadoop cluster. Contains information about blocks and their location in the cluster.
- DataNode(s) - The slave node that stores the actual data. Multiple in number.
- Secondary NameNode - Merges changes - edit log - with the FsImage in the NameNode at regular intervals of time. The modified FsImage stored in the persistent storage by the Secondary NameNode can be used in scenarios involving the failure of the NameNode.
- ResourceManager - Responsible for managing resources as well as scheduling apps running on top of YARN.
- NodeManager - It is responsible for:
- Launching the application’s containers
- Monitoring the resource usage of the aforementioned
- Report it's status and monitoring details to the ResourceManager
- JobHistoryServer - Maintains information about the MapReduce jobs post-termination of the Application Master
NameNode, DataNode(s), and Secondary NameNode are HDFS daemons, while ResourceManager and NodeManager are YARN daemons.
Question: Briefly explain Hadoop architecture?
Answer: Apache Hadoop architecture, a.k.a. Hadoop Distributed File System or HDFS follows a Master/slave architecture. Here, a cluster comprises of a single NameNode or the Master node and all the remaining nodes are DataNodes or Slave nodes.
While the NameNode contains information about the data stored i.e. metadata, DataNodes are where the data is actually stored in a Hadoop cluster.
Question: What are the differences between HDFS and NAS (Network Attached Storage)?
Answer: Following are the important points of distinction among HDFS and NAS:
Network-attached Storage is a file-level computer data storage server connected to some computer network. Simply, NAS can be some software or hardware that offers services for storing as well as accessing data files.
Hadoop Distributed File System, on the other hand, is a distributed file system that stores data by means of commodity hardware.
2. Data Storage
While data is stored on dedicated hardware in NAS, HDFS stores data in the form of data blocks that are distributed across all the machines comprising a Hadoop cluster.
HDFS is designed in such a way that it facilitates working with the MapReduce paradigm. Here, computation is shifted to the data. NAS is incompatible with the MapReduce paradigm because here data is stored separately from where the computation actually takes place.
Since HDFS uses commodity hardware, using HDFS is a cost-effective solution compared to the pricey, dedicated, high-end storage devices required by NAS.
Question: What is the major difference between Hadoop 1 and 2?
Answer: Hadoop was originally released in April of 2006. The first full-blown Hadoop release, Hadoop 1.0.0 was released in December 2011, and Hadoop 2.0.0 in October 2013. Hadoop 2 added YARN as a replacement to the MapReduce engine (MRv1) in Hadoop 1.
The central resource manager, YARN, enables running several apps in Hadoop, while all of them share a common resource. Hadoop 2 uses MRv2 - a distinct kind of distributed application - that executes the MapReduce framework on top of YARN.
Question: Please compare Hadoop 2 and 3?
Answer: Hadoop 3 was released on 13th December 2017. Following are the important differences between the Hadoop 2.x.x and Hadoop 3.x.x. releases:
1. Point of Failure
In Hadoop 2, NameNode is the single point of failure. This posed a significant problem for achieving high availability. Hadoop 3 resolved this issue with the introduction of active and passive NameNodes. When the active NameNode fails, one of the passive NameNodes can take control.
2. Application Development
Containers in Hadoop 3 work on the principle of Docker. It helps in reducing the total time required for application development.
The implementation of erasure coding in Hadoop 3 results in a decreased storage overhead.
4. GPU Hardware Usage
There is no way of executing DL (deep learning) algorithms on a cluster in Hadoop 2. This is appended in Hadoop 3 with the ability to use GPU hardware within a Hadoop cluster.
Question: Briefly explain active and passive NameNodes.
Answer: The active NameNode works and runs in a cluster. The passive NameNode has similar data as that of the active NameNode. It replaces the active NameNode only when there is a failure. Hence, its purpose is to achieve a high degree of availability.
Question: Why DataNodes are frequently added or removed from a Hadoop cluster?
Answer: There are two reasons for adding (commissioning) and/or removing (decommissioning) DataNodes frequently:
- Utilizing commodity hardware
- Scaling i.e. accommodating rapid growth in data volume
Question: What will happen if two users try to access the same file in HDFS?
Answer: Upon receiving the request for opening the file, the NameNode grants a lease to the first user. When the other user tries to do the same, the NameNode notices that the lease is already granted and thereafter, will reject the access request.
Question: Please explain how NameNode manages DataNode failures?
Answer: The NameNode receives a periodical heartbeat message from each of the DataNodes in a Hadoop cluster, implying the proper functioning of the same. When a DataNode fails to send a heartbeat message, it is marked dead by the NameNode after a set period of time.
Question: What do you understand by Checkpointing?
Answer: Performed by the Secondary NameNode, Checkpointing reduces NameNode startup time. The process, in essence, involves combining FsImage with the edit log and compressing the two into a new FsImage.
Checkpointing allows the NameNode to load the final in-memory state directly from the FsImage.
Question: Please explain how fault tolerance is achieved in HDFS?
Answer: For achieving fault tolerance, HDFS has something called Replication Factor. It is the number of times the NameNode replicates the data of a DataNode to some other DataNodes.
By default, Replication Factor is 3 i.e. the NameNode stores 3 additional copies of the data stored on a single DataNode. In case of a DataNode failure, the NameNode copies data from one of these replicas, thus making the data readily available.
Question: How does Apache Hadoop differ from Apache Spark?
Answer: There are several capable cluster computing frameworks for meeting Big Data challenges. Apache Hadoop is an apt solution for analyzing Big Data when efficiently handling batch processing is the priority.
When the priority, however, is to effectively handle real-time data then we have Apache Spark. Unlike Hadoop, Spark is a low latency computing framework capable of interactively processing data.
Although both Apache Hadoop and Apache Spark are popular cluster computing frameworks. That, however, doesn’t mean that both are identical by all means. In actual, both cater to different analysis requirements of Big Data. Following are the various differences between the two:
- Engine Type - While Hadoop is just a basic data processing engine, Spark is a specialized data analytics engine.
- Intended For - Hadoop is designed to deal with batch processing with Brobdingnagian volumes of data. Spark, on the other hand, serves the purpose of processing real-time data generated by real-time events, such as social media.
- Latency - In computing, latency represents the difference between the time when the instruction of the data transfer is given and the time when the data transfer actually starts. Hadoop is a high-latency computing framework, whereas Spark is a low-latency computing framework.
- Data Processing - Spark processes data interactively, while Hadoop can’t. Data is processed in the batch mode in Hadoop.
- Complexity/The Ease of Use - Spark is easier to use thanks to an abstraction model. Users can easily process data with high-level operators. Hadoop’s MapReduce model is complex.
- Job Scheduler Requirement - Spark features in-memory computation. Unlike Hadoop, Spark doesn’t require an external job scheduler.
- Security Level - Both Hadoop and Spark are secure. But while Spark is just secured, Hadoop is tightly secured.
- Cost - Since MapReduce model provides a cheaper strategy, Hadoop is less costly compared to Spark, which is costlier owing to having an in-memory computing solution.
More on this? Check out this in-depth Hadoop vs Spark comparison.
Question: What are the five V’s of Big Data?
Answer: The five V’s of Big Data are Value, Variety, Velocity, Veracity, and Volume. Each of them is explained as follows:
- Value - Unless working on Big Data yields results to improve the business process or revenue or in some other way, it is useless. Value refers to the amount of productivity that Big Data brings.
- Variety - refers to the heterogeneity of data types. Big Data is available in a number of formats, such as audio files, CSVs, and videos. These formats represent the variety of Big Data.
- Velocity - refers to the rate at which Big Data grows.
- Veracity - Refers to the data in doubt or uncertainty of availability due to data inconsistency and incompleteness.
- Volume - refers to the amount of Big Data, which is typically in Exabytes and Petabytes.
Question: What is the ideal storage for NameNode and DataNodes?
Answer: Dealing with Big Data involves requiring a lot of storage space for storing humongous amounts of data. Hence, commodity hardware, such as PCs and laptops, is ideal for DataNodes.
As NameNode is the master node that stores metadata of all data blocks in a Hadoop cluster, it requires high memory space i.e. RAM. So, a high-end machine with good RAM is ideal for the NameNode.
Question: Please explain the NameNode recovery process?
Answer: The NameNode recovery process involves the following two steps:
- Step 1 - Start a new NameNode using the file system metadata replica i.e. FsImage.
- Step 2 - Configure the DataNodes and clients so that they acknowledge the new NameNode.
As soon as the new NameNode completes loading the last checkpoint FsImage and receives enough block reports from the DataNodes, it will start serving the client.
Question: Why shouldn’t we use HDFS for storing a lot of small-size files?
Answer: HDFS is better-suited for storing a humongous amount of data in a single file rather than a small amount of data across multiple files.
If you use HDFS for storing a lot of small-size files then the metadata of these files will be significant in comparison to the complete data present in all of these files. This will thus require unnecessarily more amount of RAM, making the whole process inefficient.
Question: What are the default block sizes in Hadoop 1, 2, and 3? How can we change it?
Answer: The default block size in Hadoop 1 is 64MB and the same in Hadoop 2 and Hadoop 3 is 128MB. For setting the size of a block as per to the requirements, the dfs.block.size parameter in the hdfs-site.xml file is used.
Question: How can we check whether the Hadoop daemons are running or not?
Answer: In order to check whether the Hadoop daemons are running or not, we use the jps (Java Virtual Machine Process Status Tool) command. It displays a list of all the up and running Hadoop daemons.
Question: What is Rack Awareness in Hadoop?
Answer: The algorithm by which the NameNode makes decisions, in general and decides how blocks and replicas are placed, to be specific, is called Rack Awareness. The NameNode decides on the basis of rack definitions and with the intent of minimizing network traffic among DataNodes in the same rack.
The default Replication Factor for a Hadoop cluster is 3. This means for every block of data, three copies will be available. Two copies will exist in one rack and the other one in some other rack. It is called the Replica Placement Policy.
Question: Please explain Speculative Execution in Hadoop?
Answer: Upon finding a node that is executing a task slower, the master node executes another instance of the same task on some other node. Out of the two, the task that first finishes is accepted while the other one is killed. This is called Speculative Execution in Hadoop.
Question: Please explain the difference between HDFS Block and an Input Split?
Answer: An HDFS Block is is the physical division of the stored data in a Hadoop cluster. On the contrary, the Input Split is the logical division of the same.
While the HDFS divides the stored data in blocks for storing them in an efficient way, MapReduce divides the data into the Input Split and assign the same to mapper function for further processing.
Question: What are the various modes in which Apache Hadoop run?
Answer: Apache Hadoop runs in three modes:
- Standalone/local mode - It is the default mode in Hadoop. All the Hadoop components run as a single Java process in this mode and uses the local filesystem.
- Pseudo-distributed mode - A single-node Hadoop deployment runs in the pseudo-distributed mode. All the Hadoop services are executed on a single compute node in this mode.
- Fully distributed mode - In fully distributed mode, the Hadoop master and slave services run separately on distinct nodes.
Question: How will you restart NameNode or all the Hadoop daemons?
Answer: For restarting the NameNode:
- Step 1 - First, enter the /sbin/hadoop-daemon.sh stop namenode command to stop the NameNode.
- Step 2 - Now, enter the /sbin/hadoop-daemon.sh start namenode command to start the NameNode.
For restarting all the Hadoop daemons:
- Step 1 - To stop all the Hadoop daemons, use the /sbin/stop-all.sh command.
- Step 2 - To start all the Hadoop daemons once again, use the /sbin/start-all.sh command.
Question: Define MapReduce. What is the syntax for running a MapReduce program?
Answer: MapReduce is a programming model as well as an associated implementation used for generating Big Data sets with a parallel, distributed algorithm on a Hadoop cluster. A MapReduce program comprises of:
- Map Procedure - Performs filtering and sorting
- Reduce Method - Performs a summary operation
The syntax for running a MapReduce program is:
Question: Enumerate the various configuration parameters that need to be specified in a MapReduce program?
Answer: Following are the various configuration parameters that users need to specify in a MapReduce program:
- The input format of data
- Job’s input location in the distributed file system
- Job’s output location in the distributed file system
- The output format of data
- The class containing the map function
- The class containing the reduce function
- The JAR file containing the mapper, reducer, and driver classes
Question: Why it is not possible to perform aggregation in mapper? Why do we need reducer for the same?
Answer: Following are the various reasons why it is not possible to perform aggregation in mapper:
- Aggregation requires the output of all the mapper functions, which may not be possible to collect in the map phase because mappers might be running on different machines than the one containing the data blocks.
- Aggregation can’t be done without sorting and it doesn’t occur in the mapper function.
- It is tried to aggregate data at mapper, then there is the requirement for communication among all mapper functions. As different mapper functions might be running on different machines, high network bandwidth is required that might lead to network bottlenecking.
Assorting only occurs on the reducer side, we require reducer function to accomplish aggregation.
Question: Why do we need RecordReader in Hadoop? Where is it defined?
Answer: The Input Split is a portion of the task without any description on the way it is to be accessed. The RecordReader class is responsible for loading the data from its source and converting the same into K,V (Key, Value) pair, suitable for reading by the Mapper task. Input Format defines an instance of the RecordReader.
Q: Please explain the Distributed Cache in a MapReduce framework?
Answer: The Distributed Cache is a utility offered by the MapReduce framework for caching files required by applications. Once the user caches a file for a job, the Hadoop framework makes it available on all data nodes where the map/reduce tasks are running. The cache file can be accessed as a local file in the Mapper or Reducer job.
Question: Does the MapReduce programming model allows reducers to communicate with one another?
Answer: Reducers run in isolation in the MapReduce framework. There is no way of establishing communication with one another.
Question: Please explain a MapReduce Partitioner?
Answer: The MapReduce Partitioner helps in evenly distributing the map output over the reducers. It does so by ensuring that all the values of a single key go to the same reducer.
The MapReduce Partitioner redirects the mapper output to the reducer by determining which reducer is responsible for a particular key.
Question: Can you explain the steps to write a custom partitioner in Apache Hadoop?
Answer: Following is the step-by-step procedure for writing a custom partitioner in Hadoop:
- Step 1 - Create a new class that extends the Partitioner Class
- Step 2 - Next, override getPartition method in the wrapper class that runs in the MapReduce
- Step 3 - Now, you can either add the custom partitioner to the job as a config file or by using the Set Partitioner method.
Question: What do you understand by Combiner in Hadoop?
Answer: Combiners enhance the efficiency of the MapReduce framework by reducing the data required sending to the reducers. A combiner is a mini reducer that is responsible for performing the local reduce task.
A combiner receives the input from the mapper on a particular node, and sends the output to the reducer.
Question: Can you explain SequenceFileInputFormat?
Answer: Sequence files are an efficient intermediate representation for data passing from one MapReduce job to the other. They can be generated as the output of other MapReduce tasks.
The SequenceFileInputFormat is a compressed binary file format optimized for passing data among the outputs of one MapReduce job and the input of some other MapReduce job. It is an input format for reading within sequence files.
Question: List some of the most notable applications of Apache Hadoop?
Answer: Apache Hadoop is an open-source platform for accomplishing scalable and distributed computing of large volumes of data. It offers a rapid, performant, and cost-effective way of analyzing structured, semi-structured, and unstructured data. Following are some of the best use cases of Apache Hadoop:
- Analyzing customer data in real-time
- Archiving emails
- Capturing and analyzing clickstream, social media, transaction, and video data
- Content management
- Fraud detection and prevention
- Traffic management
- Making sense out of unstructured data
- Managing content and media on social media platforms
- Scientific research
- Streaming processing
Question: What are the benefits of using Distributed Cache?
Answer: Using a Distributed Cache has the following perks:
- It can distribute anything ranging from simple, read-only text files to complex files like archives.
- It tracks the modification timestamps of cache files.
Question: What is a Backup Node and a Checkpoint NameNode?
Answer: The Checkpoint NameNode creates checkpoints for namespace at regular intervals. It does so by downloading the FsImage, editing files, and merging the same within the local directory. Post merging, the new FsImage is uploaded to the NameNode. It has the same directory structure as that of the NameNode.
The Backup Node is similar to the Checkpoint NameNode in terms of functionality. Although it maintains an up-to-date in-memory copy of the file system namespace, it doesn’t require noting changes at regular intervals of time. In simple terms, the Backup Node saves the current state in-memory to an image file for creating a new Checkpoint.
Question: What are the common input formats in Apache Hadoop?
Answer: Apache Hadoop has three common input formats:
- Key-Value Input Format - Intended for plain text files where the files are broken into lines
- Sequence File Input Format - Intended for reading files in sequence
- Text Input Format - This is the default input format in Hadoop
Question: Explain the core methods of a Reducer?
Answer: There are three core methods of a Reducer, explained as follows:
- cleanup() - Used only once at the end of a task for cleaning the temporary files.
- reduce() - Always called once per key with the associated reduced task.
- setup() - Used for configuring various parameters, such as distributed cache and input data size.
Question: Please explain the role of a JobTracker in Hadoop?
Answer: A JobTracker in Hadoop cluster is responsible for:
- Resource management i.e. managing TaskTrackers
- Task lifecycle management i.e. tracking task progress and tasks’ fault tolerance
- Tracking resource availability
Question: How is the Map-side Join different from the Reduce-side Join?
Answer: The Map-side Join requires a strict structure. It is performed when data reaches the Map and the input datasets must be structured. The Reduce-side Join is simpler as there is no requirement for the input datasets to be structured. The Reduce-side Join is less efficient than the Map-side Join as it needs to go through sorting and shuffling phases.
Question: Do you know how to debug Hadoop code?
Answer: Start by checking the list of MapReduce jobs that are currently running. Thereafter, check whether there are one or many orphaned jobs running or not. If there is then it is required to determine the location of RM logs. This can be done as follows:
- Step 1 - Use the ps -ef | grep -I ResourceManager command to look for the log directory in the result. Find out the job ID and check whether there is an error message with the orphaned job.
- Step 2 - Use the RM logs to identify the worker node involved in the execution of the task concerning the orphaned job.
- Step 3 - Log in to the affected node and run the following code:
ps -ef | grep -iNodeManager
- Step 4 - Examine the Node Manager log. Most of the errors are from the user-level logs for each MapReduce job.
That sums up our list of the top Hadoop interview questions. Hope you found these helpful for preparing for your upcoming interview or just checking your progress in learning Hadoop. And, don’t forget to check out these best Hadoop tutorials to learn Hadoop.
Preparing for Apache Spark interview? Check out these important Spark interview questions.
People are also reading: