π€ What is Data Normalization?
Data normalization is the process of adjusting values in a dataset to a common scale, without distorting differences in the ranges of values.
Raw datasets often contain features with different scales (e.g., age in years vs. salary in dollars).
Without normalization, models may give more importance to large-scale features.
π Example: If one feature ranges from 1β1000 and another from 0β1, the model might prioritize the larger range feature.
π Why is Normalization Important in Machine Learning?
βοΈ Equal importance: Prevents features with larger values from dominating.
π Faster convergence: Speeds up gradient descent in training neural networks.
π― Better accuracy: Improves performance for algorithms sensitive to scale (e.g., KNN, SVM, Logistic Regression).
π οΈ Common Normalization Techniques in Python
1οΈβ£ Min-Max Normalization (Rescaling)
![formula]()
β
Best for algorithms requiring bounded values.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[10], [20], [30], [40], [50]])
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
2οΈβ£ Z-Score Normalization (Standardization)
![formula1]()
β
Useful when data follows a Gaussian distribution.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)
3οΈβ£ Robust Normalization
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
robust_data = scaler.fit_transform(data)
print(robust_data)
4οΈβ£ L2 Normalization
Scales each data point (vector) so that the sum of squared values = 1.
Useful in text classification, NLP, and clustering.
from sklearn.preprocessing import Normalizer
scaler = Normalizer(norm='l2')
l2_normalized = scaler.fit_transform(data)
print(l2_normalized)
π§ͺ When to Use Which Normalization?
Technique | When to Use | Example Use Case |
---|
Min-Max | When you need bounded values (0β1) | Neural Networks |
Z-Score | When data is normally distributed | Logistic Regression |
Robust | When dataset has many outliers | Financial Data |
L2 Normalization | When working with vectors | NLP, Text Mining |
β
Key Takeaways
Data normalization is essential for fair feature comparison.
Pythonβs scikit-learn provides easy-to-use tools (MinMaxScaler
, StandardScaler
, RobustScaler
, Normalizer
).
Choose the right normalization method based on data distribution and algorithm requirements.