15 Top Hadoop Ecosystem Components
Big Data is a huge collection of data sets accumulated over time, which is so variable in form, size, and structure that the traditional form of RDBMS can not efficiently process it. Hadoop is a framework that helps in processing these data sets. It is made up of several modules that are supported by a large ecosystem of technical elements.
What is Hadoop Ecosystem?
Hadoop was developed based on Google's MapReduce system and is implemented on the principles of functional programming. Hadoop resolves the following main issues:
- Data Storage
- Data Structure
- Data Processing
The Hadoop Ecosystem is a software suite that provides support to resolve various Big Data problems. The Core Components of the Hadoop Ecosystem are different services that have been deployed by various organizations. Each component of the Ecosystem has been developed to deliver an explicit function.
Hadoop Ecosystem Components
The different components of the Hadoop Ecosystem are as follows:
1. The Hadoop Distributed File System: HDFS
The Hadoop Distributed File System is the most important part of the Hadoop Ecosystem. It stores structured and unstructured data sets across various nodes and maintains metadata in the form of log files. The main components of Hadoop are:
- It is the master daemon that manages and maintains the DataNodes (slave nodes).
- It records the metadata [location, size, hierarchy, permissions] of all the blocks stored in the cluster.
- It records every change that is made in the file system metadata. In case of file deletion, it will immediately log in to the edition.
- It receives regular heartbeats from the DataNodes, to ensure that they are still alive.
- It keeps a record of all the blocks in the HDFS and DataNode in which they are stored.
- It is the slave node that runs on each slave machine.
- These nodes store the actual data. It divides the input files of different formats into blocks. The DataNodes stores each of these blocks.
- It is responsible for serving read and write requests from the clients.
- It is also responsible for creating, deleting, and replicating blocks based on the decisions made by the Namenode.
- It sends heartbeats every 3 seconds to the NameNode to report the overall health of the HDFS.
It is the core data processing component of Hadoop. It is a software framework that helps in writing applications that process massive datasets using parallel and distributed algorithms within the Hadoop Environment. MapReduce framework takes care of failures. It recovers data from another node in an event where one node goes down.
In MapReduce, Map() and Reduce() are two functions.
- Map() – This function performs sorting and filtering of data and organizes them in the form of a group. It takes in key-value pairs and gives the output as key-value pairs.
- Reduce() – It aggregates the mapped data. Reduce() takes the output generated by Map() as an input and combines them into a smaller set of tuples.
Yet Another Resource Negotiator, YARN, helps to manage resources across clusters. It performs the scheduling and resource allocation for the Hadoop System. YARN consists of two major components:
- Resource Manager: Allocates resources for the applications in a system and schedules map-reduce jobs.
- Nodes Manager: Works in the allocation of resources such as CPU, memory, bandwidth per machine, and monitors their usage.
- Application Manager: It performs as an interface between the Resource Manager and Node Manager and performs negotiation as required. It further works with the Node Manager to monitor and execute the sub-task.
A Resource Scheduler allocated resources to various running applications. However, it does not monitor the status of the application. Hence in the event of any failure, it does not restart the same.
Based on SQL methodology and interface, its query language is called HQL. It supports all SQL data types, which makes the query processing easier. Similar to Query Processing Frameworks, Hive comes with two components: JDBC Drivers and the HIVE Command-Line. JDBC, along with ODBC drivers, work on establishing the data storage permissions and connections, whereas the HIVE command line helps in the processing of queries. It performs reading and writing of large datasets. It allows both real-time and batches processing.
The main components of the HIVE are:
- MetaStore – It stores Metadata
- Driver – Manages the lifecycle of the HQL Statement.
- Query Compiler – Compiles HQL into DAG[Directed Acyclic Graph]
- Hive Server – Provides interface for JDBC/ODBC Server
Developed by Yahoo, PIG is a query processing language for querying and analyzing data stored in HDFS. PIG has two components – PIG Latin and the Pig Runtime. PIG Latin has an SQL Command like structure. A MapReduce job is executed at the back-end of a Pig Job.
The main features of the PIG are as follows:
- Extensibility: Allows users to create their custom functioning.
- Optimization opportunities: Automatically optimizes the query allowing users to focus on semantics rather than efficiency.
- Handles all kinds of data: Analyze both structured as well as unstructured data.
The load command in Pig loads the data. At the backend, the compiler converts the Pig Latin into a sequence of Map-Reduce jobs. Various functions, like joining, sorting, grouping, and filtering can be performed over the data. The output can be dumped on the screen or stored in the HDFS file.
HBase is a NoSQL database built on top of HDFS. It supports all kinds of data. It provides the capabilities of Google’s Big Table and is thus able to work on Big Data sets effectively. The HBase is an open-source, non-relational, distributed database. It provides real-time read/write access to large datasets. It is a column-oriented database management system. It is suitable for sparse datasets which are very common in Big Data use cases. HBase has shallow latency storage, and enterprises use it for real-time analysis. HBase is designed to contain many tables. Each of these tables must have a primary key.
The various components of HBase are as follows:
6.1. HBase Master
- Maintains and monitors the Hadoop Cluster.
- Performs administration of the database
- Controls the failover
- HMaster handles DDL operation
6.2 Region Server
It is a process that handles read, write, update, and delete requests from the client. It runs on every node in a Hadoop cluster that is HDFS DataNode.
Mahout provides a platform for creating scalable machine learning applications. It performs collaborative filtering, clustering, and classification.
- Collaborative Filtering: determines user behavior patterns and makes recommendations based on these.
- Clustering: It groups together similar types of data like the article, blog, research paper, news, and more.
- Classification: It categorizes data into various sub-departments.
- Frequent Itemset missing: It looks for items bought together and gives suggestions accordingly.
It coordinates between the various services in the Hadoop ecosystem. It coordinates with the various features in a distributed environment. It saves a lot of time by performing synchronization, configuration maintenance, grouping, and naming. The main features of Zookeeper are as follows:
- Speed: It is fast in workloads. Its reads are more than write.
- Organization: It maintains the record of all transactions.
- Simple: It maintains a single, hierarchical namespace, similar to directories and files.
- Reliable: Zookeeper can be replicated over a set of hosts, and all instances are aware of each other. As long the major servers are available, the zookeeper is available.
It is an open-source Web Application written in Java. The Apache Oozie is a clock and alarm service inside the Hadoop Ecosystem. It is like a job scheduler. It schedules Hadoop jobs, binds them together as one logical work. It combines multiple jobs into a single unit of work. It can manage thousands of work-flow in a Hadoop cluster. It works by creating a Directed Acyclic Graph of the workflow. It is very much flexible as it can start, stop, suspend, and rerun failed jobs.
There are three kinds of Oozie jobs:
- Oozie Workflow: These are a sequential set of actions to be performed.
- Oozie Coordinator: These are the oozie jobs that are triggered when the data are made available to it. It only responds to the availability of data and rests otherwise.
- Oozie Bundle: It is a package of many coordinators and workflow jobs.
Sqoop imports data from external sources into the compatible Hadoop Ecosystem components like HDFS, Hive, HBase, and more. It transfers data from Hadoop to other external sources. It also works with RDBMS like TeraData, Oracle, MySql, and more. Sqoop can process structured as well as unstructured data. When a Sqoop command is submitted, it gets divided into several sub-tasks at the backend, these sub-tasks are map-tasks. Each map-task imports data to Hadoop; hence, all the map-tasks brought together to import the whole data. Sqoop Export also works in the same way. Here the map task exports the part of data from Hadoop to the destination database.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving massive amounts of streaming data from various web servers into HDFS. Flume has three components:
- Source: It accepts the data from the incoming stream and stores the data in the channel.
- Channel: It is a medium of temporary storage between the source of the data and the temporary storage.
- Sink: This component collects the data from the channel and writes it permanently to the HDFS.
It is responsible for provisioning, managing, monitoring, and securing the Hadoop cluster. The different features of Ambari are as follows:
- Simplified cluster configuration, management, and installation.
- Reduced complexity in configuration and administration of Hadoop cluster security
- Defines step by step procedure for installing Hadoop services on the Hadoop cluster.
- Handles configuration of services across the Hadoop cluster.
- The dashboard is available for cluster monitoring.
- The Amber Alert framework generates alert when the node goes down or has low disk space.
13. Apache Drill
It is a schema-free SQL query engine. It is a distributed query processing language. It works on Hadoop, NoSQL and cloud storage. Its primary purpose is large scale processing of data with low latency. Following are the main features of Apache Drill:
- Ability to scale thousands of nodes.
- Supports NoSQL databases like Azure BLOB storage, Google Cloud Storage, Amazon S3, HBase, MongoDB, and so on
- A single query can be based on a variety of databases.
- Supports millions of users and serve their queries over large data sets.
- Gives faster insights without ETL overheads like loading, schema creation, maintenance, transformation, and more
- Analyzes multi-structured and nested data without transforming or filtering.
14. Apache Spark
It unifies all kinds of Big Data processing. Spark has built-in libraries for streaming, SQL, machine learning, and graph processing. Apache Spark gives a lightning-fast performance for both batch and stream processing. This is done with the help of DAG Scheduler, Query Optimizer, and physical execution engine.
- Spark can be run on a standalone cluster mode on Hadoop, Mesos, or Kubernetes.
- Spark applications can be written using SQL, R, Python, Scala, and Java.
- Spark offers 80 high-level operators, which makes it easy to build parallel applications.
- It has various libraries like
- MLlib for Machine Learning
- GraphX for graph processing
- SQL, Data Frames, and Spark Streaming
- Spark performs in-memory processing, which makes it faster than Hadoop Map-Reduce.
15. Solr & Lucene
Apache Solr and Apache Lucene are two services which searches and indexes the Hadoop Ecosystem. Apache Solr is built around Apache Lucene. Apache Lucene in built-in Java and uses Java libraries for searching and indexing. Apache Solr is an open-source search platform. The different features of Apache Solr are as follows:
- Solr is highly scalable, reliable, and fault-tolerant.
- It provides
- Distributed Indexing
- Automated Failover and Recovery
- Load Balanced Query
- Centralized Configuration
- The query can be generated using HTTP GET and receive the results in JSON, Binary, CSV, and XML.
- It provides matching capabilities like phrases, wildcards, grouping, joining, and much more.
- It has a built-in administrative interface enabling management of Solr instances.
- Solr takes advantage of Lucene’s real-time indexing. Thus, it enables a user to see content whenever you want to see it.
All the elements of the Hadoop Ecosystem are open system Apache Hadoop Project.
- At the core is the HDFS for data storage, Map-Reduce for Data Processing, and YARN as a Resource Manager.
- HIVE is a Data Analysis Tool
- PIG is a SQL like a scripting language.
- HBase – NoSQL Database
- Mahout – A Machine Learning Tool
- Zookeeper – A synchronization Tool
- Oozie – Workflow Scheduler System
- Sqoop – Structured Data Importing and Exporting Utility.
- Flume – A data transfer tool for unstructured and semi-structured data
- Ambari – A tool for managing and securing Hadoop clusters
Once you are clear on the above concepts, you can consider yourself ready for further knowledge in this field.