Top Data Analytic Tools

Kirubashalini Velu
6y
11.6k
0
3

Article

Introduction

This article explains about Modern Data Analytic Tools and its processing, types, goals, workflow etc. The basics are easy to understand.

Modern Data Analytic Tools

A large number of tools are available to process Big Data. Current techniques for analyzing Big Data with emphasis on three important emerging (come forth) tools are:

Map Reduce
Apache Spark
Storm

Most of the available tools concentrate on:

Batch Processing
Stream Processing
Interactive Analysis

Batch Processing

Batch processing (execution of a series of jobs in a program on a computer) tools are based on the Apache Hadoop infrastructure, such as:

Mahout
Dryad

Stream Processing

It is equivalent to data flow programming parallel processing. Stream Data applications are mostly used for real-time analytics.

Examples of large-scale Streaming Platforms are:

Storm
Splunk

Interactive Analysis

The interactive analysis process allows users to directly interact in real time for their own analysis.

Dremel
Apache Drill

Workflow of Big Data Project

Apache Hadoop and MapReduce

It is the most established software platform for big data analysis.
It consists of the Hadoop kernel, MapReduce, Hadoop distributed file system(HDFS) and Apache Hive etc.
Map reduce is a programming model for processing large datasets based on the divide and conquer Method. This method is two-step implementation:

Map Step
Reduce Step

Hadoop works on two kinds of node,

Master Node
Worker Node

(Divides the input and output into smaller subproblems and distributes to worker node)

Helpful in the fault tolerant storage and high throughput (amount of data) data processing.

Apache Mahout

It aims to provide scalable and business machine learning techniques for large-scale and IDA applications.
It including clustering, classification, pattern mining, regression, dimensionalty reduction, evolutionary algorithms.

Goal
To build a vibrant, responsive, diverse community to facilitate discussions on the project and potential use cases.
Objective
To provide a tool for elevating big challenges.

Different companies implementing scalable machine learning algorithms are Google, IBM, Amazon, Yahoo, Twitter, and Facebook.

Apache Spark

It is an open source Big Data processing framework built for Speed processing and Sophisticated Analytics.
Sparklets allow us to quickly write the app in Java, Scala or Python.
It supports SQL queries, streaming data, machine learning and graph data processing.
It consists of three components,

Driven Program
Cluster Manager
Worker Node

The driver Program server as the starting point of execution.
The Cluster Manager allocates the resource and the worker node to do the data processing in the form of task.

Dryad

It is another popular programming model for implementing parallel and distributed programs for handling large context based on data flow graph.
It consists of a cluster of computing nodes.
A dryad user uses thousands of machines, each of them with multiple processors or cores.
Its advantage is users do not need to know anything about concurrent programming.
It provides a large number of functions including generating of job graph, scheduling of the machines available.
Processes transition failureshandling in the cluster, and the collection of performance metrics.

Storm

It is a distributed and fault-tolerant real-time computation system for processing large streaming data.
Especially designed for real-time processing in contrast with Hadoop, which is for batch processing.
It is also easy to set up and operate, fault tolerant to provide competitive performance.
Storm cluster is similar to the Hadoop cluster.
Storm cluster users run different topologies for different storm tasks.

Apache Drill

It is another distributed system for interactive analysis of big data.
It has more flexibility to support many types of query languages, data formats and data sources.
Especially designed to exploit nested data.

Objective

To scale up on 10,000 servers or more and reach the capability to process petabytes of data and trillions of records in seconds.
Drill use HDFS (Hadoop Distributed File System) for storage and Map Reduce to perform a batch analysis.

Splunk

In recent years a lot of data are generated through the machine from business industries.
It is a real-time and intelligent platform developed for exploiting machine-generated big data.
It combines the up to the moment cloud technologies and big data.
It helps the user to search, monitor, and analyze their machine-generated data through the web interface.

Web interface

The results are exhibited in an intuitive way such as graphs, reports, and alerts.

Objective

The Splunk is to provide metrics ( measurement) for many application, diagnose problems for the system and IT infrastructures, Intelligent support for business operations.

Dremel

It is a scalable, interactive ad-hoc query system for analysis of read-only nested data.
By combining multi-level execution trees and columnar data layout.
It is capable of running aggregation queries over trillion-row tables in second.
The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google.

Drop here!