What is MapReduce? - Introduction
MapReduce often confused as a tool – is actually a programming model or a framework designed for parallel processing. With the advent of big data, it became necessary to process large chunks of data in the least amount of time and yet give accurate results. MapReduce can quickly summarize, classify, and analyze complex datasets.
1. The traditional way of doing things
Earlier, there used to be centralized servers to store and process all the data. This mechanism had one problem – the server would always be overloaded and sometimes crash because the data from different sources would go for processing to the centralized system.
2. MapReduce way of processing
MapReduce was developed as an algorithm by Google to solve this issue. Now, it is extensively used in the Apache Hadoop framework, which is one of the most popular frameworks for handling big data. In this approach, rather than data being sent to a central server for processing, the processing and computation would happen at the local data source itself, and the results would then be aggregated and reduced to produce the final output. The sequence is always ‘Map’ and then ‘Reduce.’ The map is first performed, and once it is 100% complete, Reduce is executed. We will discuss this with a detailed diagram in a while, but before that, just keep the following features handy –
Features of MapReduce
- Written in Java; language independent
- Large scale distributed and parallel processing
- Local processing, with in-built redundancy and fault tolerance
- Map performs filtering and sorting while reducing aggregations
- Works on Linux based operating systems
- Comes by default with Hadoop framework
- Local processing rather than centralized processing
- Highly scalable
MapReduce has many design patterns and algorithms. In many articles on the web, you must have seen the basic counting, summing, and sorting algorithms. There are other algorithms like collation, grepping, parsing, validation (based on some conditions). More complex patterns include processing of graphs or iterative message passing, counting unique (distinct) values, data organization (for further processing), cross-correlation, Relational patterns like selection, projection, intersection, union, difference, aggregation and joins can also be implemented in MapReduce terms.
How MapReduce works
Now comes the exciting part, where we will see how the entire process of Map and Reduce works in detail.
There are 3 steps in the MapReduce algorithm which get executed sequentially –
1. Map function
The map function gets the input dataset (the huge one) and splits it into smaller datasets. Each dataset is then processed parallelly, and required computations are done. The map function converts the input into a set of key-value pairs.
As we see in the diagram, the input data set is present in the HDFS (Hadoop Distributed File System). From there, it is split into smaller datasets upon which sub-tasks are performed parallelly. Then, the data is mapped as key-value pairs, which is the output of this step.
Data are shuffling consists of merging and sorting. Shuffle is also called a combined function. The input of this stage is the key-value pairs obtained in the previous step.
The first step is merging, where values with same keys are combined, thus returning a key-value pair where value is a list and not a single value – (Key, List[values])
The results are then sorted based on the key in the right order.
This is now the input to the Reduce function.
The reduce function performs some aggregation operations on the input and returns a consolidated output, again as a key-value pair.
Note that the final output is also key-value pairs and not a list, but aggregated one. For example, if you have to count the number of times the word ‘the,’ ‘MapReduce’ or ‘Key’ has been used in this article, you can select the entire article and store it as an input file. The input file will be picked by the MapReduce libraries and executed. Suppose this is the input text,
‘MapReduce is the future of big data; MapReduce works on key-value pairs. Key is the most important part of the entire framework as all the processing in MapReduce is based on the value and uniqueness of the key.’
The output will be something like this –
MapReduce = 3
Key = 3
The = 6
amongst the other words, like future, big, data, most, the values of which will be 1. As you might have guessed, words are the keys here, and the count is the value.
The entire sentence will be split into 3 sub-tasks or inputs, and parallel processed, so let us say,
Input 1 = ‘MapReduce is the future of big data; MapReduce works on key-value pairs. Key is the most important part of the entire framework
Input 2 = as all the processing in MapReduce is based on the value and uniqueness of the key.
In the first step, of mapping, we will get something like this,
|MapReduce = 1|
|The = 1|
|MapReduce = 1|
|Key = 1|
|Key = 1|
|The = 1|
|The = 1|
Amongst other values.
Same way, we would get these values from Input 2 –
|The = 1|
|MapReduce = 1|
|The = 1|
|The = 1|
|Key = 1|
Among other results.
The next step is to merge and sort these results. Remember that this step gives out a list of values as the output.
The next step is Reduce, which performs the aggregation, in this case – sum. We get the final output that we saw above,
|MapReduce = 3|
|Key = 3|
|The = 6|
Advantages of MapReduce
We have already seen that the MapReduce approach to handle data is faster because of its parallel processing. Here are some more advantages of using MapReduce –
- The ability to store data in a distributed environment across multiple servers makes it scalable. If more and more servers are added, the efficiency of the system further improves.
- Both structured and unstructured data can be quickly processed irrespective of the data source and type of data, thus giving the necessary flexibility that was not there with traditional relational database systems.
- Cost-effective solution to analyze huge chunks of business data as the servers deployed for parallel processing is cheap.
- Data redundancy and fault tolerance is high because even if one node fails, the data can be recovered from other nodes, and is never lost. The model also recognizes and corrects faults very quickly.
- Easy to learn. Since it is written in Java, it is easy to learn and apply. Developers can easily use the model to write their own implementation for specific business purposes. The package provides many samples, like word counter, sudoku solver, multifile word counter, tera sort, word meaning, word median.
- Security features are in-built to provide only authorized access to critical data of the organization.
Constraints in MapReduce
Rather than categorizing certain situations as disadvantages of MapReduce, we would instead prefer saying that MapReduce may not be the best solution in some scenarios. This is true for any programming model. Some cases where MapReduce falls short –
- Real-time data processing – While the MR model works on vast chunks of data stored somewhere, it cannot work on streaming data.
- Processing graphs
- If you have to process your data again and again for many iterations, this model may not be a great choice
- If you can get the same results on a standalone system and do not have multiple threads, it is not required to install multiple servers or do parallel processing.
Some real-world examples
We have already seen how MapReduce can be used for getting the count of each word in a file. Let us take some more practical problems and see how MapReduce can make analysis easy.
- Identifying potential clicks for conversion – Suppose you want to design a system to identify a number of clicks that can convert, out of all the clicks. Out of the vast data received from publishers or ad networks (like Google), some clicks might be fraudulent or non-billable. From this vast data set (let’s say of 50 million clicks), we need to fetch the relevant information, for example, the IP address, day, city. Using MapReduce, we can create a summary of the data by dividing this vast data set into smaller subsets. Once we get the summary subsets, we can sort and merge them. This will generate a final single summary set. On this set, we can apply the required rules and other analyses to find the conversion clicks.
- Recommendation engines – The concept of recommendation engines is prevalent nowadays. Online shopping giants like Amazon, Flipkart, and others offer to recommend ‘similar’ products or products a user may like. Netflix offers movie recommendations.
One way is through movie ratings. The MapReduce function first maps users, movies, and ratings and creates key-value pairs. For example, (movie, ) where movie name is the key and the value can be a tuple containing the user name and their respective ratings. Then through correlation (mutual relation), we can find the similarity between the two movies; for example, movies of the same genre or finding users who have seen both movies and shared their ratings can give us the information on how closely the two movies are related.
- Storing and processing health records of patients: Patient health records can be digitally stored using the MapReduce programming model. The data can be stored on the cloud using Hadoop or Hive. In the same manner, massive sets of big data can be processed, including clinical, biometrics, and biomedical data, with promising results for analysis. MapReduce helps in making the biomedical data mining process faster.
- Identifying potential long-term customers based on their activities: based on the customer details and their transaction details, the MapReduce framework can determine the frequency of a user’s transactions and the total time he spends on them. This information will then be shuffled and sorted and then iterate through each customer’s transactions to know their number of visits and the total amount spent by them to date. This will give a fair idea of long-term customers, and companies can send offers and promotions to keep the customers happy.
- Building user profiles for sending targeted content – It is effortless to build user profiles using MapReduce. The algorithms of sorting join, correlation, are used to analyze and group users based on their interests so that relevant content can be sent to a specific set of users.
- Data tracking and logistics – Many companies use Hadoop to store sensor data from the shipment vehicles. The intelligence that is derived from this data enables companies to save lots of money on fuel cost, workforce, and other logistics. HDFS can store geodata as well as multiple data points. The data is then divided into subsets and using various MapReduce algorithms, metrics like risk factors for drivers, mileage calculation, tracking, and a real-time estimate of delivery can be calculated. Each of the above metrics will be a separate MapReduce job.
Some more examples
What we have seen above are some of the most common applications of MapReduce. MapReduce is used in many more scenarios. The algorithm is extensively used in data mining and machine learning algorithms using HDFS as storage. With the introduction of YARN, the processing has moved to YARN; the storage still lies with HDFS. Some more applications of MapReduce are –
- Analyzing and indexing text information
- Crawl blog posts to process them later
- Face and image recognition from large datasets
- Processing log analysis
- Statistical analysis and report generation
With this article, we have understood the basics of MapReduce and how it is useful for big data processing. There are many samples provided along with their distribution package, and developers can write their own algorithms to suit their business needs. MapReduce is a useful framework. At Hackr, we have some of the best tutorials for Hadoop and MapReduce. Do check them out and also let us know if you found this article useful or want any more information to be added to the article.
People are also reading: