Getting Started With Apache Spark

Puja Kose
May 31
22.8k
0
7

Article

In Big Data, Hadoop components such as Hive (SQL construct), Pig ( Scripting construct), and MapReduce (Java programming) are used to perform all the data transformations and aggregation. Now, with Apache Spark, the same is being achieved with many more advantages like unified API performance, support for multiple languages, and 10X-100X faster than MapReduce. Spark provides a single platform with SQL, Scripting, as well as the programming construct.

Big Data (Setting up the context)

The amount of data has grown considerably in recent years due to the growth of social networking, education, surveillance cameras, healthcare, business, satellite images, manufacturing, online purchasing, research analysis, banking, bioinformatics, Internet of Things, criminal investigation, media, information technology, etc. This huge volume of data in the world has created a new field of data processing which is called Big Data.

Data can be private or public

Private data includes
Surveys/Questionnaire
Clicks
Messages
Transaction
Page views
Purchases
Public data includes
Tweets
Blogs
Reports
Comments
Reviews

So, to do something meaningful with the data, we have to convert the unstructured data which is messy and semantically complex into structured data which is clean and easy to consume. This is called Data Processing.

Data Processing Tasks

Parsing fields from the text.
Accounting for missing values.
Identifying and investigating anomalies.
Summarizing using tables and charts.

The complexity of the data can be measured by the messiness and speed of scaling of data as explained in the below points.

Spreadsheets
Low data collection frequency.
- 10-100s of rows per day.
- Sometimes, it involves manual data collection.
- Many files.
Database
High frequency of collection.
- 100k rows per day.
- Programmatically corrected.
- ACID
Distributed Computing
- A very high frequency of data collection.
- Billions/Millions of rows per day.
- Files are stored across a cluster of machines.
- Many many files (Web pages, log files).

Tools for Data Processing

Apache Spark

Apache Spark is an open-source, lightning, cluster-computing framework. It is an engine for data processing and analytics.

Features

Speed
Support Multiple Languages.
Advanced Analytics.

Characteristics

General Purpose
- Exploring
- Cleaning and preparing.
- Applying machine learning.
- Building Data Applications
Interactive Environment
- Called REPL (Read-Evaluate-Print-Loop).
- Interactive environment.
- Fast Feedback.
Distributed Computing
- Process data across a cluster of machines.
- Integrate with Hadoop.

In order to work with Spark we have to use Spark APIs like,

Scala
Python
Java

Almost all the data is processed using specific data structures called RDDs (Resilient Distributed Datasets).

RDDs are the main programming abstraction in Spark.
With RDDs, you can interact and play with billions of rows of data.
RDDs are in-memory collections of objects.
A storage system that stores the data to be processed.
Cluster manages to help spark run tasks across a cluster of machines.

Components of Spark

Spark

Spark Core: Computing Engine.
Storage System: Stores the data to be processed. For storage systems, you can use a Local file system or HDFS.
Cluster manager: Help Spark run tasks across a cluster of machines. For Cluster Manager, you can use a built-in cluster manager or YARN (Yet-Another-Resource-Locator).

Storage System and Cluster Manager, are both plug-and-play components.

Apache Spark Ecosystem

Apache Spark

In the next article, we will discuss more regarding RDDs and will learn how to load a data set.

Up Next

Ebook Download

View all

Printing in C# Made Easy

Read by 22.5k people

Download Now!

Learn

View all