Ordinal & Label Encoding in Machine Learning

Kautilya Utkarsh
1y
1.2k
0
4

Article

Introduction

In machine learning, dealing with categorical variables is a common task. Categorical variables represent data that can take a limited, fixed number of possible values, such as colors, types of animals, or levels of education. However, most machine learning algorithms require numerical input, which means that categorical variables must be converted into numerical form before they can be used for training a model.

Understanding Ordinal Encoding

Generally we always have two types of data, ordinal and nominal . Nominal data are those data which can be classified without ordering and ranking where Ordinal data always have predefined natural ranking and ordering , for nominal data we use One-Hot Encoding & in case of Ordinal data we prefer Ordinal Encoding.

Example. Let’s understand Ordinal and nominal data with example,

Ordinal and nominal data

In the above table both columns have categorical values . We can see that Review column can be converted into numeric ordering/ranking like 0 for bad , 1 for good and 2 for excellent hence it is Ordinal data but in case of Car_Brand column we can only put them in different category and can’t give numeric ordering/ranking hence it Nominal data .

Here, In this technique we will encode an ordinal categorical feature as an integer array. We use the ordinal encoding technique only for the ordinal categorical column of the input columns .

Understanding Label Encoding

We use this technique only for the target label i.e ‘Y’ . In this Encoding process we encode the target label with the values between 0 to (n_classes - 1).

It doesn’t provide any proper ordering or ranking, only assign them a unique numeric value to each class of the data .

Machine Learning

Implementation

import numpy as np
import pandas as pd
#Create A dataset to understand the Implementation
data = {'Student id': [101, 102, 103, 104, 105,106,107,108,109,110,111],
        'Grade': ['D','A','C','F','C','A','E','B','D','F','B'],
        'Remarks': ['Improve', 'Excellent', 'Good', 'WorkHard','Good', 'Excellent','Try Again','Great','Improve','WorkHard','Great'],
        'Result': ['P', 'P', 'P', 'F','P', 'P','F','P','P','F','P'],

        }
# cover the dataset into pandas dataframe
df = pd.DataFrame(data)
#  View the table to analyze the data.
df

Student_id column have only numeric values already then proceed and encode the remaining columns .

df=df.iloc[:,1:]
df

only cat. table

#split the data frame into test & train 
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size=0.2)
# to perform ordinal encoding we will import OrdinalEncoder from sklearn 
from sklearn.preprocessing import OrdinalEncoder
#Lets see our splited dataframe
X_train

X_train original

Y_train.head(8)

Y_train original

# now create the object of OrdinalEncoder class and pass parameter categories to that as a list
oe = OrdinalEncoder(categories=[['F','E','D','C','B','A' ],['WorkHard','Try Again','Improve','Good','Great','Excellent']])
oe.fit(X_train)

Conclusion

Ordinal encoding is used for categorical variables with a natural ranking, while label encoding is applied to the target label, assigning unique numeric values without establishing order. Choosing the right encoding method ensures accurate representation for machine learning models.