Distributed System is the core concept of Hadoop.
Why do we need Distributed System?
Companies like Facebook ,NASA, and Google have huge amount of data (can be hundreds of petabytes or exabytes). Each and every like, click, comment, review, update, search, call, visit, piece of financial and health information, transaction, photo upload etc., which is a very huge data set, gets stored somewhere(database).This data is called Big Data.
The amount of data has grown considerably in recent years due to the growth of social networking, education, surveillance cameras, health care, business, satellite images, manufacturing, online purchasing, research analysis, banking, bioinformatics, Internet Of Things, criminal investigation, media , information technology etc. This huge volume of data in the world has created a new field in data processing which is called Big Data.
When the amount of data is smaller, a system with limited resource (hardware, software etc.) is able to handle it, but as the data grows that same system is unable to handle data so an idea is to increase the size of system ie; hardware, software etc. but in this there is high risk of failure like,
Expensive
- Hardware costs.
- Software costs
- High risk of failure
But the data is always increasing so Hadoop distributed computing concepts came to save us.
Instead of a single powerful machine the task is distributed among clusters of machines.
Advantages
- Commodity hardware.
- License software is free.
- Reduce risk of failure ie; Single point of failure (SPOF).
- 10 times processing power in 1/10th of the cost.
BigData
Big data are the data sets that are high in volume and complex and so it gets difficult to process it using on-hand database management tools or traditional data processing application softwares.
There are some challenges to Big Data which include capturing data, data storage, querying, data analysis, search, sharing, transfer, visualization, updating and information privacy.
There are three dimensions to big data known as Volume, Variety and Velocity.
Big Data System Requirements
Traditional data technologies are not able to meet these requirements.
Distributed Computing frameworks like hadoop were developed for exactly this.
So a single system cannot meet big data requirements (Resources , memory, storage, speed, time).So we need a distributed system.
There are two ways to build a system
Monolithic Distributed
Monolithic
- A powerful player.
- A single powerful server.
- 2x Expense
- <2x performance
Distributed
- A team of good players who knows how to pass.
- A cluster of descent machines that know how to parallelize.
Distributed System
Many small and cheap computers come together which acts as a single entity and such a system can scale linearly.
- 2 x nodes
- 2 x storage
- 2 x speed
Server Farms
Companies like Facebook, Google, and Amazon are building vast server farms. These farms have thousands of servers working in tandem to process complex data.
All these servers need to be co-ordinated by a single piece of software.
Single co-ordinating software-
- Partition data
- Co-ordinating computing task.
- Handle fault tolerance and recovery.
- Allocate capacity to processes.
Introducing Hadoop
- Google has developed a software to run on these distributed system
- Stores millions of records on multiple machines
- Run processes on all these machines to crunch data.
Single co-ordinating software
Hadoop
There are 2 core components of Hadoop-
HDFS: Hadoop Distributed File System
MR: MapReduce
HDFS/MR
A framework to define data processing task.
YARN (Yet Another Resource Negotiator)
A framework to run the data processing task. //More we will discuss about this in later blog
Co-ordination between Hadoop Blocks
Fig.a User defines map and reduce tasks using mapreduce API.
Fig.b A job is triggered on the cluster.
Fig.c YARN figures out where and how to run the job and stores the result in hdfs.
Other technologies in Hadoop Ecosystem
Hadoop Ecosystem
An ecosystem of tools has sprung up around this core piece of software.
Hive
Provides an Sql interface to hadoop.
The bridge to hadoop for folks who don’t have exposure to OOP in java.
HBase
A database management system on top of hadoop.
Integrate with your application just like a traditional database.
Pig
A data manipulation language.
Transforms unstructured data into a structured format.
Query this structured data using interfaces like hive.
Spark
A distributed computing engine used along with hadoop.
Interactive shell to quickly process datasets.
Has a bunch of built in libraries for machine learning, stream processing, graph processing etc.
Oozie
A tool to schedule workflows on all the Hadoop ecosystem technologies.
Flume/ Sqoop
Tools to transfer data between other systems and Hadoop.
This is just a basic introduction to Hadoop. In the next blog we will discuss about the Hadoop Distributed File System(HDFS) Commands and HDFS Architecture.