Introduction
In machine learning, dealing with categorical variables is a common task. Categorical variables represent data that can take a limited, fixed number of possible values, such as colors, types of animals, or levels of education. However, most machine learning algorithms require numerical input, which means that categorical variables must be converted into numerical form before they can be used for training a model.
Understanding Ordinal Encoding
Generally we always have two types of data, ordinal and nominal . Nominal data are those data which can be classified without ordering and ranking where Ordinal data always have predefined natural ranking and ordering , for nominal data we use One-Hot Encoding & in case of Ordinal data we prefer Ordinal Encoding.
Example. Let’s understand Ordinal and nominal data with example,
In the above table both columns have categorical values . We can see that Review column can be converted into numeric ordering/ranking like 0 for bad , 1 for good and 2 for excellent hence it is Ordinal data but in case of Car_Brand column we can only put them in different category and can’t give numeric ordering/ranking hence it Nominal data .
Here, In this technique we will encode an ordinal categorical feature as an integer array. We use the ordinal encoding technique only for the ordinal categorical column of the input columns .
Understanding Label Encoding
We use this technique only for the target label i.e ‘Y’ . In this Encoding process we encode the target label with the values between 0 to (n_classes - 1).
It doesn’t provide any proper ordering or ranking, only assign them a unique numeric value to each class of the data .
Implementation
import numpy as np
import pandas as pd
#Create A dataset to understand the Implementation
data = {'Student id': [101, 102, 103, 104, 105,106,107,108,109,110,111],
'Grade': ['D','A','C','F','C','A','E','B','D','F','B'],
'Remarks': ['Improve', 'Excellent', 'Good', 'WorkHard','Good', 'Excellent','Try Again','Great','Improve','WorkHard','Great'],
'Result': ['P', 'P', 'P', 'F','P', 'P','F','P','P','F','P'],
}
# cover the dataset into pandas dataframe
df = pd.DataFrame(data)
# View the table to analyze the data.
df
Student_id column have only numeric values already then proceed and encode the remaining columns .
df=df.iloc[:,1:]
df
#split the data frame into test & train
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size=0.2)
# to perform ordinal encoding we will import OrdinalEncoder from sklearn
from sklearn.preprocessing import OrdinalEncoder
#Lets see our splited dataframe
X_train
Y_train.head(8)
# now create the object of OrdinalEncoder class and pass parameter categories to that as a list
oe = OrdinalEncoder(categories=[['F','E','D','C','B','A' ],['WorkHard','Try Again','Improve','Good','Great','Excellent']])
oe.fit(X_train)
Now transform the X_train and X_test
X_train =oe.transform(X_train)
X_test = oe.transform(X_test)
#Now check the encode from of X_train
X_train
X_test
We have successfully completed the ordinal encoding process ,Now input data i.e X_train & X_test set is ready to fit in any ML model.
#Now import the LaberEncoder from sklearn to perform Label encoding
from sklearn.preprocessing import LabelEncoder
# Create the object of the LabelEncoder Class
le = LabelEncoder()
le.fit(Y_train)
le.classes_
#transform
Y_train = le.transform(Y_train)
Y_test = le.transform(Y_test)
#veiw the Y_train encoded
Y_train
# View the Y test in ecoded form
Y_test
Now we have seccessfull encoded the X_train,X_test,Y_train and Y_test .
Conclusion
Ordinal encoding is used for categorical variables with a natural ranking, while label encoding is applied to the target label, assigning unique numeric values without establishing order. Choosing the right encoding method ensures accurate representation for machine learning models.