Modules

Introduction To Machine Learning
  1. What Is Machine Learning Beginners Guide
  2. Supervised Vs Unsupervised Learning Key Differences
  3. Scikit Learn Tensorflow Keras Beginners Guide
  4. Setting Up Ml Environment Python Jupyter Conda Vscode
Data Preprocessing And Feature Engineering
  1. Understanding Data Types Machine Learning
  2. Handling Missing Data Outliers Data Preprocessing
  3. Feature Scaling Normalization Vs Standardization
  4. Feature Selection Dimensionality Reduction Pca Lda
Supervised Learning With Scikit Learn
  1. Master Scikit Learn Basics Api Data Splitting Workflows
  2. Predict House Prices Linear Regression Scikit Learn
  3. Logistic Regression Spam Detection Scikit Learn
  4. Master Support Vector Machines Svm Classification
  5. Model Evaluation Cross Validation Precision Recall F1 Score
Unsupervised Learning With Scikit Learn
  1. Introduction To Clustering Kmeans Dbscan Hierarchical
  2. Master Pca Dimensionality Reduction Scikit Learn
  3. Anomaly Detection Scikit Learn Techniques Applications
Introduction To Deep Learning Tensorflow Keras
  1. What Is Deep Learning Differences Applications
  2. Introduction To Tensorflow Keras Deep Learning
  3. Understanding Neural Networks Beginners Guide
  4. Activation Functions Relu Sigmoid Softmax Neural Networks
  5. Backpropagation Optimization Deep Learning
Building Neural Networks With Keras
  1. Build Simple Neural Network Keras Guide
  2. Split Data Training Validation Testing Keras
  3. Improve Neural Network Performance Keras Dropout Batch Norm
  4. Hyperparameter Tuning Keras Tuner Guide
Cnns For Image Processing
  1. Introduction To Cnns For Image Processing
  2. Build Cnn Mnist Image Classification Keras
  3. Boost Cnn Performance Data Augmentation Transfer Learning
Rnns And Lstms
  1. Understanding Rnns Lstms Time Series Data
  2. Build Lstm Stock Price Prediction Tensorflow
  3. Text Generation Lstms Tensorflow Keras
Natural Language Processing
  1. Text Preprocessing Nlp Tokenization Word Embeddings
  2. Sentiment Analysis Lstm Tensorflow Keras
  3. Text Classification Bert Tensorflow Keras Guide
Deploying Machine Learning Models
  1. Exporting Models Tensorflow Scikit Learn
  2. Deploy Machine Learning Models Flask Fastapi
  3. Deploying Ml Models To Cloud Platforms
All Course > Python Machine Learning > Supervised Learning With Scikit Learn Oct 12, 2024

Master Decision Trees and Random Forests with Scikit-Learn

In the previous lesson, we explored Logistic Regression, a powerful tool for binary classification tasks like spam detection. We learned how to train a model to predict whether an email is spam or not, using features like word frequency and email structure. Logistic Regression is simple yet effective, but it has its limits, especially when dealing with complex, non-linear data. This brings us to Decision Trees and Random Forests, which are more flexible and robust for such scenarios.

Use-Case: Predicting House Prices

I recently worked on a project where I had to predict house prices based on features like location, size, and number of rooms. At first, I tried using Linear regression, but the model struggled to capture the non-linear relationships in the data. That’s when I turned to Decision Trees, which can split data into smaller subsets based on feature values, making them ideal for handling complex patterns. Later, I improved the model’s accuracy by using Random Forests, which combine multiple Decision Trees to reduce overfitting and enhance performance.

Step 1: Building a Decision Tree for Classification

A Decision Tree is a flowchart-like structure where each internal node represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label. To build a Decision Tree for classification, we use the DecisionTreeClassifier class from Scikit-Learn. Here’s an example using the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Decision Tree
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

# Evaluate the model
accuracy = tree.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

This code trains a Decision Tree to classify iris flowers into one of three species. The model achieves high accuracy, but it’s prone to overfitting, especially with complex datasets.

Step 2: Building a Decision Tree for Regression

Decision Trees can also be used for regression tasks. For example, let’s predict house prices using the DecisionTreeRegressor class:

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Decision Tree
tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)

# Evaluate the model
score = tree.score(X_test, y_test)
print(f"R^2 Score: {score:.2f}")

While the model performs well on the training data, it often overfits, meaning it captures noise instead of the underlying pattern. This is where Random Forests come in.

Step 3: Introducing Random Forests

A Random Forest is an ensemble of Decision Trees that work together to improve performance. Each tree is trained on a random subset of the data, and the final prediction is made by averaging the predictions of all trees (for regression) or taking a majority vote (for classification). This approach reduces overfitting and increases accuracy.

Here’s how to build a Random Forest for classification:

from sklearn.ensemble import RandomForestClassifier

# Create and train the Random Forest
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

# Evaluate the model
accuracy = forest.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

For regression, use the RandomForestRegressor class:

from sklearn.ensemble import RandomForestRegressor

# Create and train the Random Forest
forest = RandomForestRegressor(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

# Evaluate the model
score = forest.score(X_test, y_test)
print(f"R^2 Score: {score:.2f}")

Random Forests are more robust and less prone to overfitting compared to single Decision Trees.

Step 4: Mitigating Overfitting with Random Forests

Overfitting occurs when a model learns the training data too well, including its noise and outliers, which harms its performance on new data. Random Forests mitigate overfitting by introducing randomness in two ways:

  1. Each tree is trained on a random subset of the data (bootstrap sampling).

  2. Each split in a tree is made using a random subset of features.

This randomness ensures that no single tree dominates the model, leading to better generalization.

Conclusion

In this tutorial, we explored Decision Trees and Random Forests, two powerful tools for classification and regression tasks. Decision Trees are simple and interpretable but prone to overfitting. Random Forests address this issue by combining multiple trees, resulting in more accurate and robust models. If you’re working on a project with complex data, Random Forests are often a great choice.

In the next lesson, we’ll dive into Support Vector Machines (SVM), another versatile algorithm for both classification and regression. Stay tuned to learn how SVMs can help you tackle even more challenging problems!

Comments

There are no comments yet.

Write a comment

You can use the Markdown syntax to format your comment.