Throughput vs. Random Seek
Since we are working with large datasets with sequential read, Hadoop and Map Reduce is optimized for throughput and not random seek.
In other words, if one must move a big house from one city to another than a slow big truck will do a better job, than a fancy small fast car.
Data Locality
It is cheaper to move the compute logic than data. In Hadoop, we move the computation code around where the data is present, instead of moving the data back and forth to the compute server; that typically happens on a client/server model.
Data Storage in HDFS
Let's say we need to move a 1 Gig text file to HDFS.
HDFS in Picture
Examples of HDFS
HDFS List file
$ hdfs dfs –ls /
HDFS move file to cluster
$ hdfs dfs -put googlebooks-eng-all-1gram-20120701-b /apps/hive/warehouse/menish.db/ngram/.
HDFS view file in the cluster
$ hdfs dfs -cat /apps/hive/warehouse/menish.db/ngram/googlebooks-eng-all-1gram-20120701-a
HDFS is a virtual distributed file system. It exposes file system access similar to a traditional file system. However, the file is split into many parts in the background and distributed on the cluster for reliability and scalability.
In the next article, we will discuss the map-reduce program and see how to leverage this data structure and storage paradigm.
Resources: