Logistic Regression for Spam Detection with Scikit-Learn
In the last lesson, we explored linear regression, which helps predict continuous values like house prices. Now, we'll shift gears to logistic regression, a method used for binary classification tasks. Unlike linear regression, which predicts numbers, logistic regression predicts probabilities that fall between 0 and 1. This makes it perfect for problems like spam detection, where we need to classify emails as either "spam" or "not spam."
I recently worked on a project where I had to build a spam filter for an email service. The goal was to classify thousands of emails quickly and accurately. Using logistic regression, I was able to create a model that could predict whether an email was spam with high confidence. This practical experience taught me how powerful logistic regression can be for binary classification tasks.
In this tutorial, we’ll walk through the steps of building a spam detection model using Scikit-Learn. We’ll also learn how to evaluate its performance using metrics like accuracy, precision, recall, and F1 Score. By the end, you’ll have a solid understanding of logistic regression and how to apply it to real-world problems.
Difference Between Linear and Logistic Regression
Linear regression is used to predict continuous outcomes, like house prices or temperature. It works by fitting a straight line to the data, which minimizes the difference between predicted and actual values. However, when dealing with binary outcomes—like spam or not spam—a straight line isn’t the best fit.
Logistic regression, on the other hand, uses a sigmoid function to map predictions to probabilities. The sigmoid function ensures that the output is always between 0 and 1, which makes it ideal for classification tasks. For example, if the model predicts a probability of 0.8 for an email, it means there’s an 80% chance the email is spam.
In my spam detection project, I initially tried using linear regression, but the results were poor. The model couldn’t handle the binary nature of the problem. Switching to logistic regression made a huge difference, as it was designed specifically for such tasks.
Applying Logistic Regression to Spam Detection
To build a spam detection model, we first need a dataset of labeled emails. Each email is represented by features like word frequency, presence of specific keywords, or email length. These features help the model learn patterns that distinguish spam from non-spam emails.
Here’s how I approached the problem:
-
Data Preparation: I cleaned the data by removing stop words, punctuation, and converting text to lowercase.
-
Feature Extraction: I used a technique called TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features.
-
Model Training: I split the data into training and testing sets, then trained a logistic regression model using Scikit-Learn.
Here’s a code snippet for training the model:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# Sample data
emails = ["win a free prize", "meeting at 3 pm", "claim your reward"]
labels = [1, 0, 1] # 1 for spam, 0 for not spam
# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
Evaluating Model Performance
Once the model is trained, we need to evaluate its performance. For binary classification, metrics like accuracy, precision, recall, and F1 Score are commonly used.
-
Accuracy: Measures the percentage of correct predictions.
-
precision: Tells us how many predicted spam emails were actually spam.
-
Recall: Indicates how many actual spam emails were correctly identified.
-
F1 Score: Combines precision and recall into a single metric.
In my project, the model achieved an accuracy of 95%, but the recall was lower at 85%. This meant that some spam emails were slipping through the filter. To improve this, I fine-tuned the model by adjusting the threshold for classification.
Here’s how to calculate these metrics in Scikit-Learn:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Predictions
y_pred = model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")
Conclusion
In this tutorial, we learned how logistic regression works and how to apply it to a spam detection problem. We covered the differences between linear and logistic regression, the steps to build and train a model, and how to evaluate its performance using key metrics.
Logistic regression is a powerful tool for binary classification, but it’s just the beginning. In the next lesson, we’ll dive into decision trees and random forests, which offer even more flexibility and accuracy for complex datasets. If you’re ready to take your machine learning skills to the next level, don’t miss the next tutorial!
Comments
There are no comments yet.