Ordinal & Label Encoding in Machine Learning

Introduction 

In machine learning, dealing with categorical variables is a common task. Categorical variables represent data that can take a limited, fixed number of possible values, such as colors, types of animals, or levels of education. However, most machine learning algorithms require numerical input, which means that categorical variables must be converted into numerical form before they can be used for training a model.

Understanding Ordinal Encoding

Generally we always have two types of data, ordinal and nominal . Nominal data are those data which can be classified without ordering and ranking where Ordinal data always have predefined natural ranking and ordering , for nominal data we use One-Hot Encoding & in case of Ordinal data we prefer Ordinal Encoding.

Example. Let’s understand Ordinal and nominal data with example,

Ordinal and nominal data

In the above table both columns have categorical values . We can see that Review column can be converted into numeric ordering/ranking like 0 for bad , 1 for good and 2 for excellent hence it is Ordinal data but in case of Car_Brand column we can only put them in different category and can’t give numeric ordering/ranking hence it Nominal data .

Here, In this technique we will encode an ordinal categorical feature as an integer array. We use the ordinal encoding technique only for the ordinal categorical column of the input columns .

Understanding Label Encoding 

We use this technique only for the target label i.e ‘Y’ . In this Encoding process we encode the target label with the values between 0 to (n_classes - 1).

It doesn’t provide any proper ordering or ranking, only assign them a unique numeric value to each class of the data . 

Machine Learning

Implementation

import numpy as np
import pandas as pd
#Create A dataset to understand the Implementation
data = {'Student id': [101, 102, 103, 104, 105,106,107,108,109,110,111],
        'Grade': ['D','A','C','F','C','A','E','B','D','F','B'],
        'Remarks': ['Improve', 'Excellent', 'Good', 'WorkHard','Good', 'Excellent','Try Again','Great','Improve','WorkHard','Great'],
        'Result': ['P', 'P', 'P', 'F','P', 'P','F','P','P','F','P'],

        }
# cover the dataset into pandas dataframe
df = pd.DataFrame(data)
#  View the table to analyze the data.
df

Initial data table

Student_id column have only numeric values already then proceed and  encode the remaining columns .

df=df.iloc[:,1:]
df

only cat. table

#split the data frame into test & train 
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size=0.2)
# to perform ordinal encoding we will import OrdinalEncoder from sklearn 
from sklearn.preprocessing import OrdinalEncoder
#Lets see our splited dataframe
X_train

X_train original

Y_train.head(8)

Y_train original

# now create the object of OrdinalEncoder class and pass parameter categories to that as a list
oe = OrdinalEncoder(categories=[['F','E','D','C','B','A' ],['WorkHard','Try Again','Improve','Good','Great','Excellent']])
oe.fit(X_train)

categories

Now transform the X_train and X_test

X_train =oe.transform(X_train)
X_test = oe.transform(X_test)
#Now check the encode from of X_train
X_train

X_train encoded

X_test

X_test encoded

We have successfully completed the ordinal encoding process ,Now input data i.e X_train & X_test set is ready to fit in any ML model.

#Now import the LaberEncoder from sklearn to perform Label encoding
from sklearn.preprocessing import LabelEncoder
# Create the object of the LabelEncoder Class 
le = LabelEncoder()
le.fit(Y_train)

Lable Encoder ready

le.classes_

check encoded

#transform
Y_train = le.transform(Y_train)
Y_test = le.transform(Y_test)
#veiw the Y_train encoded
Y_train

Y_train final

# View the Y test in ecoded form
Y_test

Y_test final

Now we have seccessfull encoded the X_train,X_test,Y_train and Y_test .

Conclusion 

Ordinal encoding is used for categorical variables with a natural ranking, while label encoding is applied to the target label, assigning unique numeric values without establishing order. Choosing the right encoding method ensures accurate representation for machine learning models.


Similar Articles