Introduction
There is a lot of buzz in the industry regarding Big Data and naturally many questions and confusion. In this series of articles, I will attempt to help ease the understanding.
Big Data
Big Data is a set of technologies that allows users to store data and compute leveraging multiple machines as a single entity. Think of it as a poor man's supercomputer.
One of the earliest examples of this technology is “Seti @ Home”, where radio signals were captured and a virtual supercomputer was leveraged using hundreds of thousands of individual personal computers, to analyze radio signals in search of intelligent life in our galaxy. Later on, similar technology was used to analyze human DNA for cancer research.
In early 2000, Google developed its own version, to analyze all the world's websites to power its search engine. Today, it's used behind all the major products powered by Google, such as Google Maps.
Later on, Google published a series of papers that were chosen by engineers at Yahoo, who eventually made the technology open source and hence here we are with the Big Data movement
Let's run through an example and see how this technology works and hopefully it will help answer some questions.
Let's calculate an average of numbers.
Today's Technology
Suppose we have 1 thousand numbers, all random positive integers.
- Formula for Average = Sum(Numbers) / Count(Numbers)
In Hadoop World
Suppose we have 1 Billion numbers, all random positive integers as X.
- Formula for Average = Sum(Numbers) / Count(Numbers)
Since we know that an average of an average is not an average, we need to do something different. Let's see what.
We want Hadoop to solve it, so we need multiple machines, let's call it M1 - M5 worker machines.
Step 1
Store 1/5 of the data in each machine, let's call each chunk of data X1 - X5.
Step 2
Calculate the Sum as S and the Count as C on each machine for the respective Chunk as in the following:
- On Machine M1, we will have X1 S1, C1
- On Machine M2, we will have X2 S2, C2
- On Machine M3, we will have X3 S3, C3
- On Machine M4, we will have X4 S4, C4
- On Machine M5, we will have X5 S5, C4
This step is also called the Map Step.
Step 3
Copy all the {S, C} pairs on a single machine.
- Sum_Master = Sum (S1, S2, S3, S4, S5)
- Count_Master = Sum (C1, C2, C3, C4, C5)
-
- Average = Sum_Master / Count_Master
This step is also, called the Reduction Step.
Observations
- Data needs to be stored on worker machines.
- Each worker machine works on its own chunk of data.
- A worker machine can work in Map mode or Reduced mode.
- Map mode: worker creates an intermediate variable.
- The Reduced mode collects data from Map workers and further reduces it to get the results.
- As you can see, it uses the divide and conquers approach, working in Batch mode.
- Highly Parallel works very well with this approach, so long as we can divide the task in smaller chunks of work and process it as a batch.
- Highly recursive tasks, such as a graph search and a recursion based algorithm do not perform well, we need something different and I will talk about it later.
- Most important, we need to break the simple Average formula to work in parallel and but not all the algorithms can be broken in such fashion.
- However, with the large quantity of data, we generally do not need very complex algorithms, simple algorithms work as well.
I hope this example was helpful, I will post a new article, with more details on the framework and the inner workings of data storage, coordination, and algorithms.
Resources