Introduction
This article explains about Modern Data Analytic Tools and its processing, types, goals, workflow etc. The basics are easy to understand.
Modern Data Analytic Tools
A large number of tools are available to process Big Data. Current techniques for analyzing Big Data with emphasis on three important emerging (come forth) tools are:
- Map Reduce
- Apache Spark
- Storm
Most of the available tools concentrate on:
- Batch Processing
- Stream Processing
- Interactive Analysis
Batch Processing
Batch processing (execution of a series of jobs in a program on a computer) tools are based on the Apache Hadoop infrastructure, such as:
Stream Processing
It is equivalent to data flow programming parallel processing. Stream Data applications are mostly used for real-time analytics.
Examples of large-scale Streaming Platforms are:
Interactive Analysis
The interactive analysis process allows users to directly interact in real time for their own analysis.
Workflow of Big Data Project
Apache Hadoop and MapReduce
- It is the most established software platform for big data analysis.
- It consists of the Hadoop kernel, MapReduce, Hadoop distributed file system(HDFS) and Apache Hive etc.
- Map reduce is a programming model for processing large datasets based on the divide and conquer Method. This method is two-step implementation:
- Hadoop works on two kinds of node,
- Master Node
- Worker Node
(Divides the input and output into smaller subproblems and distributes to worker node)
- Helpful in the fault tolerant storage and high throughput (amount of data) data processing.
Apache Mahout
- It aims to provide scalable and business machine learning techniques for large-scale and IDA applications.
- It including clustering, classification, pattern mining, regression, dimensionalty reduction, evolutionary algorithms.
- Goal
To build a vibrant, responsive, diverse community to facilitate discussions on the project and potential use cases.
- Objective
To provide a tool for elevating big challenges.
- Different companies implementing scalable machine learning algorithms are Google, IBM, Amazon, Yahoo, Twitter, and Facebook.
Apache Spark
- It is an open source Big Data processing framework built for Speed processing and Sophisticated Analytics.
- Sparklets allow us to quickly write the app in Java, Scala or Python.
- It supports SQL queries, streaming data, machine learning and graph data processing.
- It consists of three components,
- Driven Program
- Cluster Manager
- Worker Node
- The driver Program server as the starting point of execution.
- The Cluster Manager allocates the resource and the worker node to do the data processing in the form of task.
Dryad
- It is another popular programming model for implementing parallel and distributed programs for handling large context based on data flow graph.
- It consists of a cluster of computing nodes.
- A dryad user uses thousands of machines, each of them with multiple processors or cores.
- Its advantage is users do not need to know anything about concurrent programming.
- It provides a large number of functions including generating of job graph, scheduling of the machines available.
- Processes transition failureshandling in the cluster, and the collection of performance metrics.
Storm
- It is a distributed and fault-tolerant real-time computation system for processing large streaming data.
- Especially designed for real-time processing in contrast with Hadoop, which is for batch processing.
- It is also easy to set up and operate, fault tolerant to provide competitive performance.
- Storm cluster is similar to the Hadoop cluster.
- Storm cluster users run different topologies for different storm tasks.
Apache Drill
- It is another distributed system for interactive analysis of big data.
- It has more flexibility to support many types of query languages, data formats and data sources.
- Especially designed to exploit nested data.
Objective - To scale up on 10,000 servers or more and reach the capability to process petabytes of data and trillions of records in seconds.
- Drill use HDFS (Hadoop Distributed File System) for storage and Map Reduce to perform a batch analysis.
Splunk
- In recent years a lot of data are generated through the machine from business industries.
- It is a real-time and intelligent platform developed for exploiting machine-generated big data.
- It combines the up to the moment cloud technologies and big data.
- It helps the user to search, monitor, and analyze their machine-generated data through the web interface.
Web interface
The results are exhibited in an intuitive way such as graphs, reports, and alerts.
Objective
The Splunk is to provide metrics ( measurement) for many application, diagnose problems for the system and IT infrastructures, Intelligent support for business operations.
Dremel
- It is a scalable, interactive ad-hoc query system for analysis of read-only nested data.
- By combining multi-level execution trees and columnar data layout.
- It is capable of running aggregation queries over trillion-row tables in second.
- The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google.