Modules

Introduction To Machine Learning
  1. What Is Machine Learning Beginners Guide
  2. Supervised Vs Unsupervised Learning Key Differences
  3. Scikit Learn Tensorflow Keras Beginners Guide
  4. Setting Up Ml Environment Python Jupyter Conda Vscode
Data Preprocessing And Feature Engineering
  1. Understanding Data Types Machine Learning
  2. Handling Missing Data Outliers Data Preprocessing
  3. Feature Selection Dimensionality Reduction Pca Lda
Supervised Learning With Scikit Learn
  1. Master Scikit Learn Basics Api Data Splitting Workflows
  2. Predict House Prices Linear Regression Scikit Learn
  3. Logistic Regression Spam Detection Scikit Learn
  4. Decision Trees Random Forests Scikit Learn
  5. Master Support Vector Machines Svm Classification
  6. Model Evaluation Cross Validation Precision Recall F1 Score
Unsupervised Learning With Scikit Learn
  1. Introduction To Clustering Kmeans Dbscan Hierarchical
  2. Master Pca Dimensionality Reduction Scikit Learn
  3. Anomaly Detection Scikit Learn Techniques Applications
Introduction To Deep Learning Tensorflow Keras
  1. What Is Deep Learning Differences Applications
  2. Introduction To Tensorflow Keras Deep Learning
  3. Understanding Neural Networks Beginners Guide
  4. Activation Functions Relu Sigmoid Softmax Neural Networks
  5. Backpropagation Optimization Deep Learning
Building Neural Networks With Keras
  1. Build Simple Neural Network Keras Guide
  2. Split Data Training Validation Testing Keras
  3. Improve Neural Network Performance Keras Dropout Batch Norm
  4. Hyperparameter Tuning Keras Tuner Guide
Cnns For Image Processing
  1. Introduction To Cnns For Image Processing
  2. Build Cnn Mnist Image Classification Keras
  3. Boost Cnn Performance Data Augmentation Transfer Learning
Rnns And Lstms
  1. Understanding Rnns Lstms Time Series Data
  2. Build Lstm Stock Price Prediction Tensorflow
  3. Text Generation Lstms Tensorflow Keras
Natural Language Processing
  1. Text Preprocessing Nlp Tokenization Word Embeddings
  2. Sentiment Analysis Lstm Tensorflow Keras
  3. Text Classification Bert Tensorflow Keras Guide
Deploying Machine Learning Models
  1. Exporting Models Tensorflow Scikit Learn
  2. Deploy Machine Learning Models Flask Fastapi
  3. Deploying Ml Models To Cloud Platforms
All Course > Python Machine Learning > Data Preprocessing And Feature Engineering Oct 07, 2024

Feature Scaling Normalization vs. Standardization Explained

In the last lesson, we tackled handling missing data and outliers, which are key steps in cleaning raw data. Now, we move to feature scaling, a process that ensures all features contribute equally to model performance. Without scaling, features with larger values can dominate those with smaller ones, leading to biased results. This lesson will help you understand normalization and standardization, two common scaling methods, and when to use each.

Why Feature Scaling Matters

I once worked on a project where I built a model to predict house prices. The dataset had features like the number of rooms (ranging from 1 to 5) and the house area (ranging from 500 to 5000 square feet). When I trained the model without scaling, the house area dominated the number of rooms, causing poor predictions. This happened because distance-based algorithms, like K-Nearest Neighbors (KNN) or Support Vector Machines (SVM), rely on the magnitude of features. Scaling ensures all features are on the same level, making the model fair and accurate.

Normalization vs. Standardization

Normalization, also called Min-Max scaling, transforms data to a fixed range, usually [0, 1]. It’s useful when the data doesn’t follow a normal distribution. For example, if you have pixel values in an image dataset, normalization ensures all values fall between 0 and 1. On the other hand, standardization rescales data to have a mean of 0 and a standard deviation of 1. It’s ideal for data that follows a normal distribution, like height or weight.

Here’s how you can implement both in Python:

from sklearn.preprocessing import MinMaxScaler, StandardScaler  
import numpy as np  

# Example data  
data = np.array([[1, 2], [2, 3], [3, 4]])  

# Normalization  
scaler = MinMaxScaler()  
normalized_data = scaler.fit_transform(data)  
print("Normalized Data:\n", normalized_data)  

# Standardization  
scaler = StandardScaler()  
standardized_data = scaler.fit_transform(data)  
print("Standardized Data:\n", standardized_data)  

When to Use Normalization

Normalization works best when the data has varying scales and doesn’t follow a normal distribution. For instance, in image processing, pixel values range from 0 to 255. Normalizing them to [0, 1] makes it easier for models to learn patterns. However, normalization is sensitive to outliers. If your data has extreme values, they can skew the scaled data.

When to Use Standardization

Standardization is better suited for data that follows a normal distribution. It’s less affected by outliers, making it a safer choice for many real-world datasets. For example, in a dataset with features like age and income, standardization ensures both features contribute equally to the model.

Effects of Scaling on Model Performance

Scaling has a huge impact on distance-based algorithms. For example, in KNN, the algorithm calculates the distance between data points. If one feature has a larger scale, it will dominate the distance calculation, leading to poor results. Scaling ensures all features contribute equally, improving model accuracy.

Steps to Implement Feature Scaling

  1. Understand Your Data: Check the distribution of your data. Use histograms or density plots to see if it follows a normal distribution.

  2. Choose the Right Method: Use normalization for non-normal data and standardization for normal data.

  3. Apply Scaling: Use libraries like Scikit-learn to scale your data.

  4. Train Your Model: Compare model performance with and without scaling to see the difference.

  5. Evaluate Results: Check metrics like accuracy or mean squared error to measure the impact of scaling.

Here’s a simple example of feature scaling using Scikit-learn, covering both normalization and standardization.

Let’s assume we have a small dataset with two features.

Feature 1: [1, 2, 3, 4, 5]
Feature 2: [100, 200, 300, 400, 500]

Notice how Feature 2 has much larger values compared to Feature 1. Scaling will bring both features to a similar range.

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np
# Create the Dataset
data = np.array([
    [1, 100],
    [2, 200],
    [3, 300],
    [4, 400],
    [5, 500]
])

Apply Normalization (Min-Max Scaling)

Normalization scales the data to a fixed range, typically [0, 1].

# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit and transform the data
normalized_data = scaler.fit_transform(data)
print("Normalized Data:\n", normalized_data)

Output

Normalized Data:
 [[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [0.75 0.75]
 [1.   1.  ]]

Feature 1 and Feature 2 are now scaled between 0 and 1. For example, the value 100 in Feature 2 becomes 0, and 500 becomes 1.

Apply Standardization
Standardization transforms the data to have a mean of 0 and a standard deviation of 1.

# Initialize the StandardScaler
scaler = StandardScaler()
# Fit and transform the data
standardized_data = scaler.fit_transform(data)
print("Standardized Data:\n", standardized_data)

Output

Standardized Data:
 [[-1.26491106 -1.26491106]
 [-0.63245553 -0.63245553]
 [ 0.          0.        ]
 [ 0.63245553  0.63245553]
 [ 1.26491106  1.26491106]]

Both features now have a mean of 0 and a standard deviation of 1. For example, the value 100 in Feature 2 is transformed to -1.26, and 500 becomes 1.26.

This simple example shows how scaling can transform your data to make it more suitable for machine learning models. Try it out with your own datasets!

Conclusion

Feature scaling is a crucial step in data preprocessing, especially for distance-based algorithms. Normalization and standardization are two powerful techniques that ensure all features contribute equally to model performance. By understanding when and how to use each method, you can build more accurate and reliable models.

In the next lesson, we’ll dive into feature selection and dimensionality reduction techniques like PCA and LDA, which help reduce the number of features while retaining important information. Don’t miss it!

Comments

There are no comments yet.

Write a comment

You can use the Markdown syntax to format your comment.