Introduction
This article will help you to understand the need for a distributed computing environment, understand the role of Hadoop in a distributed computing setup and the core technologies, which works with Hadoop.
Just as a kick start, let's move across the data set that the giants like Facebook, Google and NSA have.
Note
All data specified below is sourced from Google.
Facebook
Data that Facebook holds | 300 Petabytes |
Data processed by Facebook in a day | 600 Terabytes |
Users in a month on Facebook | 1 Billion |
Likers in a day on Facebook | 3 Billion |
Photos uploaded in a day on Facebook | 300 Million |
NSA
Data stored in NSA | 5 Exabytes |
Processing per day | 30 Petabytes |
It is said that NSA will be touching 1.6% of internet traffic per day as we know, the NSA monitors Web searches, Websites visited, phone calls, financial transactions, health information etc.
Note - This is not an appropriate report as NSA doesn’t release its statistical values for the third parties.
Google
Oops!! Google beats the amount of data transaction compared to the other two giants!!
Data that Google has | 15 Exabytes |
Data processed in a day | 100 Petabytes |
Number of pages indexed | 60 Trillion |
Unique searches per month | 1 Billion plus |
Searches per second | 2.3 Million |
Thus, having a huge data set like single machine, however powerful it may be, cannot compute all these data at such a huge scale. Hence, we move towards a Big Data System.
Big Data System Requirements
Store - Raw Data
It should be able to store the massive amount of data as the raw data is always huge.
Process
We should be able to extract the useful information alone from it in a timely manner. Finally, we need a scaling infrastructure that can keep up with the data, which keeps growing. This should help us to store the data as volume increases and it can also process the same.
Note - Storage and processing is useless without having a scaling Infrastructure that can accommodate the needs that keep on increasing with the data.
Thus, to work with such an environment, which holds various insights, we need a distributed computing framework and there goes Hadoop.
Build System for Computation
Any system that can make computation for us can be built in two ways.
- Monolithic System
Everything on a single machine, which is a super computer and processes all the data instructions.
- Distributed System
There will be multiple machines and multiple processors. It holds a set of nodes called cluster, where no node is a Super computer , which acts as a single entity. This system can scale linearly and if you double the number of nodes in your system, you will be getting double the storage. It also increases the speed of the machine to twice.
This distributed system is where the companies like Facebook, Google, Amazon, etc. survive with the cluster of machines with the vast Server environments and this is the place, where the actual data processing takes place.
Now, all these Distributed Environment machines and not Servers need to be coordinated by a single software. This software will take care of the needs of the distributed system like partitioning the data across the multiple nodes – replicating the data on the node and coordinating the computing tasks, which will be running on the parallel machines. This software will also handle fault tolerance and recovery like disk failures and node failures. This software should also allocate the process, as managed by the capacity of the node in terms of memory and hard disk space.
My future articles will be all about Big Data – Hadoop. Thus, keep surfing for my articles.