Data Science  

Encoding of Variable in Machine Learning

Introduction

Encoding categorical variables is essential for machine learning, as many algorithms require numerical input. When we perform a task on data, that data should be in our continuous form, but some data, such as that which is in categorical form, we first change it to continuous form, and then, when it becomes short, we again change it to categorical form. There are many techniques of encoding.

Step 1. First of all, we will do the basic work like importing the library and reading the .csv file, after this, we will start the encoding process-

Importing required libraries

Total in our train and test data sets: 

  • train dataset has got 300000 rows and 25 columns
  • test dataset has got 200000 rows and 24 columns

I have downloaded this dataset from Kaggle, so our data will be clean. First of all, we will read the data file and check its size.

Step 2. We will now show the info of the dataset and the top 5 rows of data from its head.

Jupyter

Step 3. Defining the train and target.

Train and target

Before getting into encoding, I will just briefly introduce you to the types of data variables present in this data:

  • Binary data: A binary variable is a variable that has only 2 values, that is 0/1
  • Categorical data: A categorical variable is a variable that can take a limited number of values. For example, the day of the week. It can be one of 1,2,3,4,5,6,7 only.
  • Ordinal data: An ordinal variable is a categorical variable that has some order associated with it. For example, the ratings that are given to a movie by a user.
  • Nominal data: Nominal value is a variable that has no numerical importance, such as occupation, person name, etc..
  • Timeseries data: Time series data has a temporal value attached to it, so this would be something like a date or a time stamp that you can look for trends in time.

There are many techniques for encoding, some of which are:

Method 1. Label encoding

In this method we change every categorical data to a number. That is, each type will be substituted by a number. For example, we will substitute 1 for Grandmaster,2 for master, 3 for Expert, etc.. For implementing this, we will first import LabelEncoder from the sklearn module.

Label encoding

Here you can see the label-encoded output train data. We will check the shape of the train data now and verify that there is no change in the number of columns.

Method 2. On hot encoding

Our second method is encoding each category as a one-hot encoding vector. OHE is a representation method that takes each category value and turns it into a binary vector of size where all columns are equal to zero except the category column.

This produces output as a pandas dataframe. Alternatively we can use OneHotEncoder() method available in* sklearn* to convert out data to on-hot encoded data. But this method produces a sparse matrix. The advantage of this method is that it uses very few memory/CPU resources. To do that, we need to :

  • Import OneHotEncoder from sklean.preprocessing
  • Initialize the OneHotEncoder
  • Fit and then transform our data

Encoder

These 2 methods are the most commonly used encoding methods, out of which one-hot encoding is the most commonly used. Apart from this, there are some other encoding methods such as:

Feature Hashing

  • Encoding Categories with Dataset Statistics
  • Target Encoding
  • K-Fold Target Encoding

Summary

Here you can see the summary of our model performance against each of the encoding techniques we have used. It is clear that OnHotEncoder yielded maximum accuracy.

Encoding Score Wall time
Label Encoding 0.692 973 ms
OnHotEncoder 0.759 1.84 s

CDN Solutions Group a leading development company, started off as a team of four in the year 2000.