Data Wrangling And Visualization In R

Ojash Shrestha
May 18, 2021

5.4k
0
3
- facebook
- twitter
- linkedIn
- Reddit
- WhatsApp
- Email
- Print
- Other Artcile

We learned the basics about the programming language R in our previous article Basic Intro to R. Today, we’ll be moving ahead on that foundation and learn about Data Visualization and Data Wrangling in R. R is extensively used by statisticians and visualization. R is a near and dear language for most statisticians. Data Wrangling helps us get appropriate data for our visualization, and visualization itself brings meaning to our data while at the same time. Aesthetically pleasing graphs to showcase and represent our hard work is supported by R with numerous libraries which we discuss in this article.

Data Visualization

A picture is worth a thousand words. Data Visualization helps to represent information in a graphical form such that one can easily understand the gist of the data. Humans are innately visual creatures and visualizing the data using process tools in the proper way will express a lot more meaning to the world than the data in tables and array.

Data frames and Wrangling

Data frames

Data frame is two-dimensional and heterogeneous tabular data that are mutable in size. i.e. Its size can be changed.

You can know more about Data frames in R with examples from my previous article Basic Intro to R.

Data Wrangling

Data Wrangling can be understood as the process of mapping data or transforming data into the format which is appropriate for our operations such that it becomes appropriate and valuable for our analytics.

Some useful functions in R for Data Wrangling,

left_join()

It adds information from another table, i.e., all data from the left table and matching ones from the right table. It is similar to VLOOKUP in Excel.

count()

This function helps to count instances in a data frame which are the unique values of one or multiple variables.

mutate()

Thus function helps create a new column such that it adds new variables and the existing ones are preserved. Dplyr Package needs to be installed in order to access this function in R.

group_by()

It is used to group summaries by taking in existing tbl – which is a generic class for tabular data taken as an argument by dplyr function and converting it into grouped tbl.

Packages

library(tidyverse)

It is a R package that has been designed for Data Science in order to facilitate the conversation of data between user and computer. It consists of a collection of R packages, which share high-level design philosophy, low-level grammar, and data structures such that, understanding one of the packages will make it convenient to learn the others.

library(ggplot2)

In order to achieve the Data Visualization goals with R Programming Language, ggplot2 is used. This is a dedicated package for visualization and helps to upgrade the aesthetics of visual graphs in R.

Basic ggplot

Using ggplot2 with a data frame.

Let’s begin with our first ggplot,

ggplot(dataframe, aes(var1,var2))

Here, aes maps variables to “aesthetics”.

If you want to learn more about Data Visualizations in R, watch this video by AI 42,

Geoms

The layout of a ggplot2 layer is defined using geom. The layer is what we add to the graph plot. It can be used to create different charts such as Bar Charts, Scatterplot charts, and many more.

geom_point()

This point geom helps to create scatterplots which are useful to display relationships between two continuous variables.

e.g., x + geom_point(aes(size = qsec))

geom_col()

It can be used to create different chart types having columns such as stacked bars, equal size columns, dodging columns, and more.

e.g.,:ggplot(data=Titanic, aes(x=Class, y=Freq, fill=Survived))+geom_col()

Modeling

R provides numerous modeling techniques for linear and nonlinear models, classification, clustering, time series analysis, and numerous other statistical tools.

Linear Model

The linear model can be said to be an equation that shows the relationship between two variables or quantities which a constant slope i.e., rate of change. In statistics, it is used synonymously with linear regression.

The linear model can be fitted using ‘lm’ function in R.

lm(prize ~ origin + foodLabel, data=BOO)

Linear Regression

Linear Regression is a linear approach that tries to model the relationship of two different variables such that a linear equation is attempted to be fitted into the observed data.

Simple Linear Regression

Simple Linear Regression establishes a relationship using a straight line between two variables.

Multiple Linear Regression

In order to develop the relationship between the dependent variables and two or more explanatory variables, multiple linear regression is used.

Logistic Regression

Logistic Regression estimates the parameters of the logistic model which finds out the probability of binary events calculating its dependent variables such as in cases of victory/ loss, healthy/ sick. In numerous classes of events, it can be used to model images such that it can help choose between different types of animals in an image. The values of categorical variables are predicted using logistic regression.

Random Forest

Random Forest is an extensively used supervised learning algorithm that merges all the multiple decision trees it builds in order to produce a highly accurate and stable prediction. The figure below shows the process of a random forest with two different trees,

T-Test

T-Test is a statistical hypothesis test where the test statistic follows the student’s t-distribution for the null hypothesis. It Is used for hypothesis testing in statistics and in an inferential statistic which can determine the possible significant difference between the means of two different groups which could be related in some features.

Fisher’s Test

Fisher’s exact test can be defined as a significance test used in statistics in order to analyze contingency tables. It works magnificently when employed in small sample size data but also can be used for sample data of all sizes.

Visualization in R

Histogram

The histogram is used to visualize the distributions of variables in data showcasing bars of various heights displaying the spread and shape of continuous data.

Bar Plot

Bar Plots visualizes data in rectangular forms with its length and height showing the proportional values in the x and y-axis. It can describe the comparisons in discrete data categories.

Scatter Plot

As the name suggests, the scatter plot, plots data in scattered dots which represents different numeric values of variables each in the horizontal and vertical axis.

Dot - and – Whisker Plot

The estimated single predictor across multiple datasets and models can be compared using Dot-and-Whisker Plot.

Conclusion

In this article, we learned about Data Visualization and Data Wrangling in R. We learned about different functions in R, various packages such as tidyverse and ggplot2 in R, and their purposes. We also learned about Linear Regression, Logistic Regression, Random Forest, T-test, and Fisher’s Test. Thereafter, we plotted various charts using Histogram, Bar Plot, Scatter Plot, and Dot - and - Whishker Plot.

Recommended Free Ebook

An Introduction to R

Download Now!