Apache Spark is a trending framework, which is now a revolution for analysis and big data banks.
We know the data is everywhere, data from Facebook, Twitter, posts and messages etc. They all need to be processed as quickly as possible.
Apache Spark provides the programmer a programming interface, which relies on RDD (Resilient Distributed Dataset). It was designed after the Hadoop success in the market. Every new technology comes with some limitation. Since Hadoop contains some limitations in its Map/Reduce cluster, which are being resolved by Spark, Map-Reduce program reads the data and maps the function across the data and reduces the result of the map, whereas Spark’s RDDs function as a working set for distributed programs offering a restricted form of the distributed memory.
Map/Reduce program is useful to process the data, but the amount of time taken is not acceptable in most situations. Also, as we see the Map/ Reduce program, it runs for only some specific use cases, so we have to write something for working in particular cases. Therefore, Apache Spark was designed in such a manner that it works in a better way, fast and easy to use for all the general purposes.
Spark runs over the Hadoop cluster as the Hadoop runs. Along with it, many other Spark libraries are there such as Mesos, streaming, graph etc.
Hence, you must be wondering, why we would use it and for what?
We know about Spark features, which are similar like Map/Reduce, it works on the parallel distributed processing, scalability etc.
Spark requires a cluster manager and distributed storage system. For cluster management, Spark supports either standalone or Hadoop YARN and for the distributed storage system, it can interface with a wide variety including HDFS.
It also supports the pseudo distributed local mode.
One of the reasons why Apache Spark has become popular is because Spark provides data engineers and data scientists with a powerful engine, which are both fast and easy to use. It is 100x faster than Hadoop.
This allows the data practitioner to solve their machine learning, graph computation and streaming query processing problems interactively and at a much greater scale.
Like RDD, Data Frame is a distributed collection of data. It is organized in a table like the relational database. It is designed for the processing of large scale data easily.
There are various steps and components in Spark, which consists of the following.
- Spark Core
- Spark SQL
- SPARK Streaming
- MLlib(Machine Learning Library)
- Graph X
Spark Streaming is a streaming analyzing technique. It absorbs the mini batches of the data and performs the RDD on those mini batches. It consumes the data from Flume, Twitter etc. and performs all the operations on it.
Spark SQL is designed to work with Spark via SQL or HIVE QL.
SQL gets ingested in the SPARK programming language so as to process over the data.
MLib is the Machine Learning Library, which is involved in providing the algorithms for the machine.
Graph-X is a graph processing with the design view to process/manipulate the graphs and its computation.
This is an overview of Spark, which came and removed the limitations of Hadoop.