Introduction
In the realm of machine learning, data preprocessing plays a pivotal role in shaping the performance of predictive models. One essential technique that often flies under the radar is the Yeo-Johnson Transform. This versatile method for data transformation offers advantages over traditional techniques like the Box-Cox Transform by accommodating both positive and negative values, making it an invaluable tool in the data scientist's arsenal. In this article, we'll delve into the intricacies of the Yeo-Johnson Transform, its benefits, and how to implement it with practical coding examples.
Yeo-Johnson Transformation
Developed by Robert Yeo and Robert Johnson in 2000, the Yeo-Johnson Transform is a modification of the Box-Cox Transform, which was designed to handle only positive data. The Yeo-Johnson Transform extends this functionality to handle data with both positive and negative values, offering greater flexibility in data preprocessing tasks.
The Yeo-Johnson Transform is defined as:
Where
- xi is the original data point.
- yi is the transformed data point.
- λi is the transformation parameter, which can be optimized to maximize normality.
Key Benefits of Yeo-Johnson Transform
- Handles both Positive and Negative Values: Unlike the Box-Cox Transform, which is limited to positive data, the Yeo-Johnson Transform can accommodate data with a broader range of values, including negative ones.
- Preserves Zero Values: The Yeo-Johnson Transform preserves zero values in the data, making it suitable for datasets with a mixture of zero and non-zero values.
- Flexibility in Transformation: The transformation parameter 𝜆 allows for flexible adjustment of the transformation, enabling customization based on the distribution of the data.
Implementation
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import train_test_split
# Load the Boston housing dataset
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
target = pd.DataFrame(boston.target, columns=['MEDV'])
# Concatenate features and target variable
df = pd.concat([data, target], axis=1)
#View the dataframe
df.head()
Now split the dataframe.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)
# Apply Yeo-Johnson Transform to the target variable
transformer = PowerTransformer(method='yeo-johnson', standardize=True)
y_train_transformed = transformer.fit_transform(y_train)
y_test_transformed = transformer.transform(y_test)
# Convert transformed data back to DataFrame
y_train_transformed = pd.DataFrame(y_train_transformed, columns=['MEDV'])
y_test_transformed = pd.DataFrame(y_test_transformed, columns=['MEDV'])
# Display transformed target variable
print("Transformed Train Target:")
print(y_train_transformed.head())
print("\nTransformed Test Target:")
print(y_test_transformed.head())
Conclusion
The Yeo-Johnson transform is a valuable technique for normalizing skewed numerical features in machine learning. By transforming data to a more normal distribution, you can improve the performance and interpretability of your machine learning models. Remember that data normalization is not a one-size-fits-all solution, but the Yeo-Johnson transform offers a powerful and versatile approach for many machine learning tasks.