- Introduction
- Install Dot Net Core
- Install Prerequisites
- Install Java SDK
- Install 7zip
- Install Apache Spark
- Install Dot Net for Apache Spark
- Create first application using apache spark
Introduction
Nowadays we are dealing with lots of data, and many IOT devices, mobile phones, home appliances, wearable devices, etc are connected through the internet and the volume, velocity and variety of data is increasing day by day. At a certain point we need to analyze this data, represent it in a readable format, or use it to make some important and bold decisions in business. There are many tools and frameworks in the market to analyze the terabytes of data, and one of the most popular data analysis frameworks is Apache Spark.
What is Apache Spark?
Apache spark is an open source, general purpose, distributed data analytics engine for large datasets. It can be use in big data and Machine Learning.
Why do we use it?
Apache spark is a fast, robust and scalable data processing engine for big data. In many cases it's faster that hadoop. You can use it with Java, R, Python, SQL, and now with .net.
Component of Apache Spark
Figure 1
Install Dot Net Core
- Open this here.
- Download the SDK and install it
- Open the command prompt and type ‘dotnet’ to verify successful dot net core installation
Figure 2
Install Prerequisites
A. Install Java SDK
- Open the link here
- Download and Install the Java SDK
- Set the environment variable
- Open the command prompt and type ‘java- version’ to verify successful java installation
Figure 3
B. Install 7zip
- Open the link here.
- Download and Install the 7-zip
Install Apache Spark
- Open the link here.
- Download the latest stable version of Apache Spark and extract the .tar file using 7-Zip
- Place the extracted file in C:\bin
- Set the environment variable
setx HADOOP_HOME C:\bin\spark-2.4.1-bin-hadoop2.7\
setx SPARK_HOME C:\bin\spark-2.4.1-bin-hadoop2.7\
Figure 4
To verify the successful instillation, open the cmd prompt and run the following command,
%SPARK_HOME%\bin\spark-submit --version
Figure 5
Install Dot Net for Apache Spark
- Open the link here.
- Download the latest stable version of .Net For Apache Spark and extract the .tar file using 7-Zip
- Place the extracted file in C:\bin
- Set the environment variable
setx DOTNET_WORKER_DIR "C:\bin\Microsoft.Spark.Worker-0.6.0"
Also download and Install WinUtil
https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
Once it gets download copy and paste it in C:\bin\spark-2.4.1-bin-hadoop2.7\bin
Create your first application using Apache Spark.
Open the cmd prompt and type the following command to create console application
‘dotnet new console -o MyFirstApacheSparkApp’
Figure 6
Once the application is created successfully type:
“cd MyFirstApacheSparkApp“
and hit enter. To use the apache spark with .Net applications we need to install the Microsoft.Spark package.
“dotnet add package Microsoft.Spark“
Figure 7
Once the package installssuccessfully open the project in Visual Studio code.
Figure 8
Figure 9
Add data file ‘data.txt’ to our application to process the data with the following text “Betty Botter bought some butter, but the butter, it was bitter. If she put it in her batter, it would make her batter bitter, but a bit of better butter, that would make her batter better.”
Figure 10
Now open the program.cs and add the “Microsoft.Spark.Sql” namespace --- this contains all the necessary classes.
using Microsoft.Spark.Sql;
Create a const variable to set data.txt file path.
private const string Paths = "data.txt";
SparkSession class allows you to create the session, and you need to pass the app name to create session so it can be used further.
// Creating a Spark session here
Builder builder = SparkSession.Builder();
var spark = builder.AppName("spark_word_count").GetOrCreate();
In the next step we need to create a data frame that will process a file path to read data from the file. This frame can hold the data to process.
// Creating initial DataFrame here
DataFrame dataFrame = spark.Read().Text(Paths);
Now let’s write the code for counting the words of text.
// Count words
var words = dataFrame.Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))
.Select(Functions.Explode(Functions.Col("words"))
.Alias("word"))
.GroupBy("word")
.Count()
.OrderBy(Functions.Col("count").Desc());
To show the result use the “show()” method.
// results
words.Show();
After completing the option we need to stop the Spark session; for this we use Stop() method.
// Stop Spark session here
spark.Stop();
Use the below command to build the project now,
“dotnet build”
Figure 11
Run the below command to show the result, this command has several parameters including some environment variables.
%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-2.4.x-0.8.0.jar dotnet bin\Debug\netcoreapp3.1\MyFirstApacheSparkApp.dll
Figure 12
Output
Figure 13
Conclusion
Now Apache spark can be used with the .net. Spark for .net gives more flexibility to those who are more comfortable with C# and F#.
The aim of .Net for Spark is to provide accessibility of Spark API which can communicate with your application; i.e., one written in .Net.