Introduction
Many Machine Learning models work only on numeric data, so they don’t accept categorical data. Then, to fit categorical data into the Machine Learning model, we need to convert them into numerical data. To solve this problem, we will use the One-Hot Encoding technique. With this technique, we will convert the categorical values into numeric values.
One-Hot Encoding
One-hot encoding is a technique used to transform categorical variables into a binary representation suitable for machine learning models. It essentially creates a new binary column for each unique category within the original column. Each new column represents the presence (1) or absence (0) of that particular category for each data point.
This technique helps the machine to improve its performance by providing it with more information about the categorical variable. It may also increase the dimensionality.
Example: A lot of valuable data comes in the form of categories, like "color"(red, blue, green) or "size" (small, medium, large). These categorical variables pose a challenge for machine learning algorithms.
Implementation of One-Hot Encoding
Here, we will implement One-Hot Encoding using Python.
Let's import the libraries.
import pandas as pd
#for one-hot encoding we will import OneHotEncoder from the sklearn
from sklearn.preprocessing import OneHotEncoder
Now, build a small dataset for the implementation.
data = {'Student id': [10, 20, 15, 25, 30],
'Result': ['P', 'F', 'F', 'P', 'F'],
'Remarks': ['Good', 'Improve', 'WorkHard', 'Great', 'Improve'],
}
Convert this Dataset into the pandas Dataframe.
df = pd.DataFrame(data)
View the data frame for a better understanding.
df
Now, we will extract the categorical column from the dataframe i.e extract the column with object datatype.
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
Let’s initialize the OneHotEncoder.
encoder = OneHotEncoder(sparse=False)
Apply OneHotEncoder to the categorical column of the dataframe.
one_hot_encoded = encoder.fit_transform(df[categorical_columns])
Now create a new dataframe with the One Hot Encoded columns and use get_feature_names_out() to get the column names for the encoded data.
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))
Let's concatenate the One Hot Encoded dataframe and the original dataframe and give it a new name.
df_final = pd.concat([df, one_hot_df], axis=1)
See the result.
df_final
We have successfully encoded the categorical values, so now we will drop all the categorical columns
df_final = df_final.drop(categorical_columns, axis=1)
Let's see the final result of the One-Hot Encoding technique.
df_final
Conclusion
One hot encoding is a powerful technique for handling categorical data in machine learning. By converting categorical variables into binary vectors, it enables machine learning algorithms to effectively process and learn from such data. Understanding and applying one hot encoding is essential for building accurate and reliable machine-learning models.