All Course > Python Machine Learning > Supervised Learning With Scikit Learn Oct 12, 2024

Master Decision Trees and Random Forests with Scikit-Learn

In the previous lesson, we explored Logistic Regression, a powerful tool for binary classification tasks like spam detection. We learned how to train a model to predict whether an email is spam or not, using features like word frequency and email structure. Logistic Regression is simple yet effective, but it has its limits, especially when dealing with complex, non-linear data. This brings us to Decision Trees and Random Forests, which are more flexible and robust for such scenarios.

Table of Contents

Use-Case: Predicting House Prices
Step 1: Building a Decision Tree for Classification
Step 2: Building a Decision Tree for Regression
Step 3: Introducing Random Forests
Step 4: Mitigating Overfitting with Random Forests
Conclusion

Use-Case: Predicting House Prices

I recently worked on a project where I had to predict house prices based on features like location, size, and number of rooms. At first, I tried using Linear regression, but the model struggled to capture the non-linear relationships in the data. That’s when I turned to Decision Trees, which can split data into smaller subsets based on feature values, making them ideal for handling complex patterns. Later, I improved the model’s accuracy by using Random Forests, which combine multiple Decision Trees to reduce overfitting and enhance performance.

Step 1: Building a Decision Tree for Classification

A Decision Tree is a flowchart-like structure where each internal node represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label. To build a Decision Tree for classification, we use the DecisionTreeClassifier class from Scikit-Learn. Here’s an example using the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Decision Tree
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

# Evaluate the model
accuracy = tree.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

This code trains a Decision Tree to classify iris flowers into one of three species. The model achieves high accuracy, but it’s prone to overfitting, especially with complex datasets.

Step 2: Building a Decision Tree for Regression

Decision Trees can also be used for regression tasks. For example, let’s predict house prices using the DecisionTreeRegressor class:

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Decision Tree
tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)

# Evaluate the model
score = tree.score(X_test, y_test)
print(f"R^2 Score: {score:.2f}")

While the model performs well on the training data, it often overfits, meaning it captures noise instead of the underlying pattern. This is where Random Forests come in.

Step 3: Introducing Random Forests

A Random Forest is an ensemble of Decision Trees that work together to improve performance. Each tree is trained on a random subset of the data, and the final prediction is made by averaging the predictions of all trees (for regression) or taking a majority vote (for classification). This approach reduces overfitting and increases accuracy.

Here’s how to build a Random Forest for classification:

from sklearn.ensemble import RandomForestClassifier

# Create and train the Random Forest
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

# Evaluate the model
accuracy = forest.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

For regression, use the RandomForestRegressor class:

from sklearn.ensemble import RandomForestRegressor

# Create and train the Random Forest
forest = RandomForestRegressor(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)

# Evaluate the model
score = forest.score(X_test, y_test)
print(f"R^2 Score: {score:.2f}")

Random Forests are more robust and less prone to overfitting compared to single Decision Trees.

Step 4: Mitigating Overfitting with Random Forests

Overfitting occurs when a model learns the training data too well, including its noise and outliers, which harms its performance on new data. Random Forests mitigate overfitting by introducing randomness in two ways:

Each tree is trained on a random subset of the data (bootstrap sampling).
Each split in a tree is made using a random subset of features.

This randomness ensures that no single tree dominates the model, leading to better generalization.

Conclusion

In this tutorial, we explored Decision Trees and Random Forests, two powerful tools for classification and regression tasks. Decision Trees are simple and interpretable but prone to overfitting. Random Forests address this issue by combining multiple trees, resulting in more accurate and robust models. If you’re working on a project with complex data, Random Forests are often a great choice.

In the next lesson, we’ll dive into Support Vector Machines (SVM), another versatile algorithm for both classification and regression. Stay tuned to learn how SVMs can help you tackle even more challenging problems!

Comments

There are no comments yet.

Modules