Yeo-Johnson Transform in Machine Learning

Kautilya Utkarsh
1y
2.1k
0
1

Article

Introduction

In the realm of machine learning, data preprocessing plays a pivotal role in shaping the performance of predictive models. One essential technique that often flies under the radar is the Yeo-Johnson Transform. This versatile method for data transformation offers advantages over traditional techniques like the Box-Cox Transform by accommodating both positive and negative values, making it an invaluable tool in the data scientist's arsenal. In this article, we'll delve into the intricacies of the Yeo-Johnson Transform, its benefits, and how to implement it with practical coding examples.

Yeo-Johnson Transformation

Developed by Robert Yeo and Robert Johnson in 2000, the Yeo-Johnson Transform is a modification of the Box-Cox Transform, which was designed to handle only positive data. The Yeo-Johnson Transform extends this functionality to handle data with both positive and negative values, offering greater flexibility in data preprocessing tasks.
The Yeo-Johnson Transform is defined as:

Yeo-Johnson Transformation Formula

Where

xi is the original data point.
yi is the transformed data point.
λi is the transformation parameter, which can be optimized to maximize normality.

Key Benefits of Yeo-Johnson Transform

Handles both Positive and Negative Values: Unlike the Box-Cox Transform, which is limited to positive data, the Yeo-Johnson Transform can accommodate data with a broader range of values, including negative ones.
Preserves Zero Values: The Yeo-Johnson Transform preserves zero values in the data, making it suitable for datasets with a mixture of zero and non-zero values.
Flexibility in Transformation: The transformation parameter 𝜆 allows for flexible adjustment of the transformation, enabling customization based on the distribution of the data.

Implementation

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import train_test_split

# Load the Boston housing dataset
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
target = pd.DataFrame(boston.target, columns=['MEDV'])

# Concatenate features and target variable
df = pd.concat([data, target], axis=1)

#View the dataframe 
df.head()

dataframe

Now split the dataframe.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Apply Yeo-Johnson Transform to the target variable
transformer = PowerTransformer(method='yeo-johnson', standardize=True)
y_train_transformed = transformer.fit_transform(y_train)
y_test_transformed = transformer.transform(y_test)

# Convert transformed data back to DataFrame
y_train_transformed = pd.DataFrame(y_train_transformed, columns=['MEDV'])
y_test_transformed = pd.DataFrame(y_test_transformed, columns=['MEDV'])

# Display transformed target variable
print("Transformed Train Target:")
print(y_train_transformed.head())

Train Target data

print("\nTransformed Test Target:")
print(y_test_transformed.head())

Test Target data

Conclusion

The Yeo-Johnson transform is a valuable technique for normalizing skewed numerical features in machine learning. By transforming data to a more normal distribution, you can improve the performance and interpretability of your machine learning models. Remember that data normalization is not a one-size-fits-all solution, but the Yeo-Johnson transform offers a powerful and versatile approach for many machine learning tasks.