Introduction
In this article, I am going to describe what machine learning and data science are and then, we will see how we can apply machine learning techniques in data science. Also, we will do some practical implementations in R language of data pre-processing.
This article is for those who are willing to learn about data science, especially for those freshers who wish to start their career in data science.
So, let’s start with machine learning.
Machine learning is a branch of Artificial Intelligence (AI) that makes our system more robust and gives us the ability to learn and solve the problems automatically by previous experiences and certain sets of rules (Algorithms).
In machine learning, there are many sub branches and concepts along with a vast variety of algorithms which we will discuss and deploy in our future articles.
Now, let’s jump into Data Science.
In simple words, it is a science to process data and transform the data into knowledge. It is a process to handle and drive the data and make it presentable in different types of reporting formats. Statistical modeling and data visualization are the most important parts of Data Science.
So how does machine learning aid in Data Science?
Machine Learning (ML) gives a huge boost to Data Science by applying ML algorithms. With certain techniques, we produce powerful results and predictions which can help make our future plans and strategies for business or any other domain.
Now, let’s start with the machine learning. In the first part, I will explain some data preprocessing steps and show their implementation code in R.
Data Pre-processing
Whenever you interact with the data, you have to preprocess it or in simple terms, you have to clean the data and make it smooth for analysis. The preprocessing steps we follow here are as follow -
- Handling Missing Data
- Categorize the data
- Split the data
- Feature scaling
In this part, I am going to show you how to handle missing figures in our data and in the remaining pre-processing steps.
We have a data set of 10 employees.
Here, we can easily see the missing age of employee 1007 and the salary of 1005. To fill the gap of missing data, we have mean, median, and mode strategy but here, we use mean strategy to fill the gap. We simply take the mean of our required column where data is missing then put the value there.
To do this, I wrote a code in R which gives us mean values. The code snippet is given below.
First, import the dataset.
- dataset=read.csv('dataset.csv')
Below code can take the mean of a column's Age and Salary.
- dataset$Age = ifelse(is.na(dataset$Age),
- ave(dataset$Age, FUN = function(x) mean(x, na.rm=TRUE)),
- dataset$Age)
- dataset$Salary = ifelse(is.na(dataset$Salary),
- ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
- dataset$Salary)
After running the above lines of code, the outcome will be the following.
Highlighted values are our mean.
Conclusion
Machine Learning and Data Science are more powerful fields which help us in decision making and predicting the future trends. So here, we have just started our journey. I will explain more options and algorithms in my future articles.