Data Visualization
Data visualization is the process of transforming large data sets into a statistical and graphical representation. It is an essential task of data science and knowledge discovery techniques to make data less confusing and more accessible.
Visualization takes a huge complex amount of data to represent charts or graphs for quick information to absorb and better understandability. It avoids hesitation on large data sets table to hold audience interest longer.
Univariate plots: Uni means “one’. So, those properties we will find out by using a single feature.
Bivariate plots: In Bivariate, we will compare the exact two futures to analyze its properties.
Multivariate plots: When we compare data with more than two features, it is called Multivariate.
Today, Python offers a lot of libraries and packages for various analytic techniques. Here, we will see some most frequently used libraries for effective visualization techniques.
Requirements
- Spyder IDE 3.7
- Iris data sets
Statistic overview
Importing packages and libraries
Here, I am going to do all demonstrations with the "Spyder IDE" from Anaconda distribution which provides us advanced editing, interactive testing, debugging, and flexible analysis with fewer codes. For more details, you should visit the below link.
For my convenience, I will transform each library to different symbolic variables. Such as matplotlib to plt, pandas to pd, and seaborn to sns.
- import pandas as pd
- import matplotlib.pyplot as plt
- import seaborn as sns
Here, we will use the Iris flower dataset, which is a multivariate and one of the famous datasets available at the UCI machine learning repository. In our data set, we don’t have any missing or misspelled values so we can directly move on to the importing process.
Let’s read our Iris dataset with the help of the “Pandas” package and transform it into the “Iris” variable. The syntax goes like “variable = package.read mode (data sets path)”.
- #Read Data
- iris=pd.read_csv('iris.csv')
Let’s move on to a quick overview of the data frame to get some basic ideas about that by doing four easy steps, given below
Step 1 - Data set shape
The “shape()” method can help us to find how much of observation are hold in the data frame.
Here we will see what happens after executing the commands.
The Iris data set contains 150 observations under six columns of Iris measurements in centimeters.
Step 2 - Peek at the data sets
The “head()” method can be used to fetch corresponding user-specified information at the data frame.
- #head of Iris upto 15 column
- print(iris.head(15))
Step 3 - Distribution of class
The “groupby()” method in python count the entire data frame and provides how much data each class label contains.
- #Size
- print(iris.groupby('iris-Species').size())
The Iris data sets contain 50 instances from each of the 3 class.
Step 3 - Data sets summary
The “describe ()” function is useful for getting a quick summary from the large volume of data sets such as min, max, and mean values.
The command goes like this,
- #Describe
- print(iris.describe())
Visualization
Let’s look at the data frame for clear understanding with the help of Pandas, Matplotlib, and Seaborn.
Before moving on that, I will show you every part that should be handled in a visual graph, so we are going to do that by executing these commands.
- #Seaborn plot example
- sns.set_style("darkgrid")
- sns.FacetGrid(iris, hue="iris-Species", size=4) \
- .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
- .add_legend()
- plt.title('Iris Flowers')
- plt.xlabel('X-Label')
- plt.ylabel('Y-Label')
- plt.show()
- Plt.title() - To set title for the plot
- Plt.grid() - To enable Horizontal and Vertical line in background of the layer.
- sns.set_style() - Seaborn provide aesthetic style of plot whether the grid is enabled.
- Sns.facetGrid() - To takes the data frame as an input to form the row, column, and hue to structure the grid.
- Plt.xlabel() - Set variable for X axis
- Plt.ylabel() - Set variable for Y axis
- Sns.add_legend() - Labeled representation of the plots which used to identify available colored plots.
According to our previous definition (Types of analysis), we will demonstrate various visualization techniques.
Univariate analysis
Boxplot
The seaborn library is available to show you Boxplot which performs to summarize a range and give more statistical details from a large volume of data. It will split each class of records into representing in three ways of quartiles denoted by Q1, Q2, and Q3 quartiles respectively.
- #Seaborn Boxplot
- sns.boxplot(x='iris-Species',y='SepalLengthCm',data=iris)
- plt.show()
The above commands handle the Iris flower data sets to show under the univariate plot. The X-axis handles the class labels then the Y-axis handles the Iris distribution like Sepal length. Each flower has appeared in a different color with a combination of whisker, quartile, and outlier of it.
In the Boxplot, we can get how much of the data and outlier points presented belongs to each flower. The Iris virginica only contains an outlier point then the Setosa has holding low-level values.
Each flower was shown their values as quartiles with the help of maximum and minimum whiskers.
Distribution plot
The distribution plot of class label generally performs as a combination of probability density function and Histogram in a single figure.
Here the univariate analysis, how we are going to do the univariate analysis by executing these commands.sns.distplot( iris["SepalLengthCm"], bins=20 )
- #Seaborn Distribution plot
- sns.distplot( iris["SepalLengthCm"], bins=20 )
- plt.show()
The “distplot()” method can take the Iris distributions and number of bins to show the Distribution plot with the help of the seaborn library.
Above the figure, the histogram is shown data distribution forming by bins and the drawing bar shown us several sepal length observations.
Bar chart with count plot
The Bar chart one of the favorite and widely used plot to understand data frame easily and find each how much of data has it. It is also one simple and powerful analysis method in visualization techniques.
- #Seaborn Countplot
- sns.countplot('iris-Species', data=iris)
- plt.show()
The “countplot()” method performs to count the entire data sets to shows with their categorical variables.
In the above figure, we can get an idea of how many observations contained in each Iris Species.
Each Flowers measurement in the data set has equal values (each 50) as we saw the “shape ()” method.
Violin plot
The violin plot generally performs like a combination of Boxplot and Kernel Density Estimation (KDE).
It shows the distribution of numerical/ quantitative data of the categorical variable. It can hold more than information than the Boxplot. It took multiple categorical variables to shows an effective and attractive way of distribution.
- #Seaborn Violin plot
- sns.violinplot(x='iris-Species',y='SepalWidthCm',data=iris)
- plt.show()
Above the code should be taken Class label in the X-axis and Sepalwidth at the Y-axis.
In the above figure, we can see a higher density of Sepal length belongs to three Iris flower datasets. The Iris Setosa Sepal length has high-density values among the three datasets.
Bivariate Analysis
Here, we will switch our positions to see all the demos with distribution plots of Iris data sets.
Scatter plot
The scatter plot is a 2-dimensional representation graph mostly used to compare two variables. It shows the data as a collection of points should position on either Horizontal or Vertical dimension.
- #Pandas Scatter plot
- iris.plot(kind='scatter', x='SepalLengthCm', y='SepalWidthCm',label='iris',color='red')
- plt.show()
The “plt.scatter()” method takes few categorical variables from large amount data sets to display simple visualization.
The seaborn method helps us to display attractive 2D & 3D graphical representation from a large amount of data. The entire data sets will be present as a scatter plot to shows us the correlation between categorical variables.
- #Seaborn Scatter plot
- sns.FacetGrid(iris, hue="iris-Species", size=5) \
- .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
- .add_legend()
- plt.show()
The “hue” argument can decide to show different color plots according to the Iris Species(Class label).
Above the colored scatter has presented according to their class labels mentioned to the right side of the figure.
In this seaborn scatter figure, we can get a clear understanding of data distributions. The Iris versicolor and virginica contain some overlap points that belong to their sepal length and sepal width.
Multivariate plot
Pair plot
Pair plot allows us to visualize the distribution of the entire numerical variable from our given data sets. It's available to provide better visualization when we need to display in a 3D or 4Dimensional way. Here the diagonal plots will be represented as Histograms. Each feature in the data frame should present Row and Column wise according to their corresponding X-axis and Y-axis.
- #Seaborn Pair plot
- sns.pairplot(iris,hue='iris-Species',kind='reg')
- plt.show()
Above the figure, we can get the line plot and histogram in a different color from the class labels, respectively. Here we can see, the Iris Setosa can hold quite different petal length and petal width values; that’s why they separated with others. Here we get the pairwise relationship between all variables through the univariate distribution of diagonal axes.
Heat map
A heat map is a 2D graph that can take an entire data frame to differentiate features with high positive or negative values. It will be creating a Grid like a plot where each Tile is color based on the values. It helps us to find out the correlation and coefficient between different features. It is useful where will be cluster analysis or deal with a large number of data sets. An example is given below.
The commands will go like,
- #Seaborn Heatmap
- sns.heatmap(iris.corr(),linewidth=0.3,vmax=1.0,square=True, linecolor='black',annot=True)
- plt.show()
the “heatmap()” method can display parameter included arguments according to itself.
Joint plot
The joint plot considers both Univariate and Bivariate plot analysis.
It performs as combinations of Histogram and scatters plot representation to help us find a correlation between the two variables.
- #Seaborn Joint plot
- sns.jointplot(x='SepalLengthCm',y='SepalWidthCm', data=iris, kind='resid')
- plt.show()
Here, the X-axis can include Sepal length, and Y-axis includes Sepal width of Iris species to display a joint plot with the help of the seaborn library.
The above figure, the univariate plot (KDE plot) at the top and right are KDE's of Sepal length and Sepal width respectively. Then the central graph of the scatter plot has shown us the relationship between the Iris sepal length and sepal width.
RadViz
In multivariate analysis, we are going to do a demo with the RadViz algorithm. It takes each feature of data sets to plot uniformly around the circumference of a circle.
It consists of the spring tension minimization algorithm -- each point represents as a single attribute to normalizes its values on the axes.
If the data frame contains any missing or misspelled values, then the RadViz perform to throw a Data warning message like percent missing.
- #Pandas Radviz
- from pandas.plotting import radviz
- radviz(iris, "iris-Species")
plt.show()
The pandas help us to import the RadViz followed by the “radviz()” method to visualize Iris Species according to its features.
The three Iris species plotted within the circle then their distribution plotted on the circumference of a circle.
Andrews curves
The Andrews curve method was introduced in "David Andrews's paper" in 1972. It supports pandas to secure means, distance, and variance by large data sets.
- #Pandas Andrews curves
- from pandas.plotting import andrews_curves
- andrews_curves(iris.drop("Id",axis=1),'iris-Species')
- plt.show()
Pandas are available to support the “Andrews curves ()” method to provide a smoothed version of a parallel coordinate plot.
Each class label can differentiate with different colors to appear with understandable visualization.
Conclusion
In this article, we had a quick overview of Visualization and why we are using it for machine learning tasks, and I hope you understood how to do this.
References
https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html
https://www.kaggle.com/benhamner/python-data-visualizations
https://www.geeksforgeeks.org/plotting-graph-using-seaborn-python