Building an Email Spam Detection Model – Supervised Learning

Introduction

In this guide, we’ll walk you through the process of building a supervised learning project to detect spam emails using the Naive Bayes algorithm. We’ll cover setting up the project, loading and preprocessing data, training the model, and evaluating it. By the end of this tutorial, you’ll have a fully functioning spam detection model.

Step 1: Setting Up the Environment

1.1 Create a Virtual Environment

To keep your project dependencies organized, it’s a good idea to set up a virtual environment. This ensures that your project’s libraries don’t conflict with others on your system.

Run the following commands in your terminal or command prompt:

# Create a virtual environment
python -m venv venv

# Activate the virtual environment (for Windows)
./venv/Scripts/activate

# For Linux/Mac, use:
# source ./venv/bin/activate

# Upgrade pip
python.exe -m pip install --upgrade pip

1.2 Install Required Libraries

Once your virtual environment is activated, install the necessary libraries:

# Install required libraries
pip install pandas scikit-learn numpy nltk --cache-dir "D:/internship/supervised_learning/email_spam_detection/.cache"

Step 2: Creating the Dataset from Email Files

If your dataset is made up of separate files for spam and ham emails, you’ll need to consolidate them into a CSV file.

SpamAssassin Public Corpus Dataset download

Description: A well-known dataset containing spam and ham emails, categorized into different folders.
Link: SpamAssassin Public Corpus
How to use: You can download individual archives of spam and ham emails, then extract and use them for your project.

A dataset is a collection of data that is typically organized in a structured format and used for analysis, training machine learning models, or solving specific problems. Datasets can come in various formats such as CSV (comma-separated values), Excel files, databases, text files, or even collections of images or sounds.

Key Features of a Dataset:

Data Instances (Rows):
- These are individual entries or records in the dataset. For example, in a dataset of emails, each email is a single instance.
Attributes or Features (Columns):
- These represent the characteristics of the data. For instance, in an email spam detection dataset, you might have columns like email (the content of the email) and label (whether it’s spam or not).
Labels or Target Variable:
- This is the outcome you are trying to predict. In supervised machine learning, the label is the actual result or output, such as spam or not spam in a spam detection task.
Types of Data in a Dataset:
- Numerical Data: Data that consists of numbers (e.g., age, price).
- Categorical Data: Data that falls into distinct categories (e.g., spam/not spam, color).
- Text Data: Free-form text, such as email bodies or document contents.
- Images/Sounds: Sometimes datasets consist of non-textual data like images, sounds, or videos.

Example of a Dataset (Email Spam Detection):

Email Content	Label (Spam or Not Spam)
“Congratulations! You’ve won a free iPhone.”	Spam
“Meeting at 3 PM. Please review the report.”	Not Spam
“Get cheap loans now with low interest rates!”	Spam
“Your Amazon order has been shipped.”	Not Spam

Email Content is the feature (input).
Label is the target variable or output that you want the model to predict.

Types of Datasets in Machine Learning:

Training Dataset:
- The dataset used to train a machine learning model. It contains both features (input) and labels (output).
Test Dataset:
- A separate dataset used to evaluate the performance of the trained model. It helps determine how well the model generalizes to unseen data.
Validation Dataset:
- Sometimes used to fine-tune models during training, ensuring that the model doesn’t overfit to the training data.

In the Context of Your Project:

For your email spam detection project, the dataset would typically consist of a collection of emails (the feature) along with labels (spam or not spam), which the model will use to learn patterns associated with spam emails.

In this case, a dataset might look like:

Email Content	Label
“Win $1000 now by clicking this link!”	Spam
“Reminder for the meeting tomorrow at 10 AM.”	Not Spam
“Hurry! Last chance to get 50% off on all items.”	Spam

You will use this data to train a machine learning model to classify new emails as spam or not spam based on their content.

Difference between spam and ham

The difference between spam and ham lies in their classification as types of email:

Spam:

Definition: Spam refers to unwanted, unsolicited emails sent in bulk, often for advertising, phishing, or malicious purposes.
Content: Spam emails typically include promotions for products or services, deceptive offers, requests for personal information (phishing), or malicious attachments or links.
Purpose: The goal of spam is usually to persuade recipients to take an action, such as clicking a link, downloading malware, or buying a product. Many spam emails are sent out in bulk to a large number of recipients, often without their consent.
Examples:
- “Congratulations! You’ve won a prize! Click here to claim it.”
- “Get a 90% discount on all products now!”
- “Urgent: Verify your account information to avoid closure.”

Ham:

Definition: Ham refers to legitimate, wanted emails that are not spam. These are emails that you expect or have requested, and they are important for personal or professional communication.
Content: Ham emails are typically from people or organizations with whom you have a relationship, and the content is relevant to you. They can be personal messages, business emails, newsletters you’ve subscribed to, or any emails that aren’t spam.
Purpose: Ham emails serve genuine communication purposes such as business correspondence, notifications, transactional messages, or personal conversations.
Examples:
- “Reminder: Meeting at 3 PM tomorrow.”
- “Your Amazon order has been shipped.”
- “Family reunion this Saturday. Please RSVP.”

Key Differences Between Spam and Ham:

Feature	Spam	Ham
Solicitation	Unsolicited, sent without recipient’s consent	Expected or requested by the recipient
Content Type	Advertisements, phishing, malware, scams	Personal or professional communication, newsletters
Frequency	Often sent in bulk to many recipients	Typically sent to specific individuals or groups
Purpose	To promote, deceive, or spread malware	To communicate genuinely or provide relevant information
Legitimacy	Usually illegal or against service terms	Legitimate emails from known or trusted sources

In the Context of Spam Detection:

Spam: Your model will learn to identify patterns and keywords often associated with spam, like “Congratulations,” “Win,” “Click here,” etc.
Ham: The model will recognize normal, useful emails that are important and legitimate, like personal or work-related communication.

2.1 Load Email Files and Create a Dataset

Use the following script to read all email files from their respective directories and store them in a DataFrame:

A DataFrame is a two-dimensional, tabular data structure used primarily in the pandas library in Python. It is similar to a table in a database, an Excel spreadsheet, or a CSV file, with rows and columns. Each column in a DataFrame can have a different data type (e.g., integers, floats, strings, etc.), and the rows represent individual records or observations.

NLTK (Natural Language Toolkit) is a Python library used for working with human language, like analyzing and processing text. It helps you break down text into smaller parts (words or sentences), clean it up, and make sense of it for tasks like identifying the meaning of words, classifying text as positive or negative, or figuring out if an email is spam.

Stopwords are common words in a language that carry little meaningful information on their own, such as “the,” “is,” “in,” “on,” etc. These words are often removed during text preprocessing in natural language processing (NLP) tasks to focus on the more relevant words.

import os
import pandas as pd

# Define paths to the directories containing the spam and ham emails
spam_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/easy_ham'

# Function to read all email files and store them in a list
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))  # Tuple (email_content, label)
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)  # 1 for spam
ham_emails = load_emails_from_directory(ham_dir, 0)    # 0 for ham

# Combine spam and ham into a single list
all_emails = spam_emails + ham_emails

# Create a DataFrame with two columns: 'email' and 'label'
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Save the DataFrame to a CSV file
df.to_csv('spam_ham_dataset.csv', index=False)

print(f'Dataset saved with {len(df)} emails.')

This script reads all spam and ham email files from their respective directories, creates a DataFrame, and saves it to a CSV file named spam_ham_dataset.csv.

Step 3: Loading and Preprocessing the Data

3.1 Load the CSV Dataset

Once you have your dataset saved as a CSV file, load it into your Python environment using pandas:

import pandas as pd

# Load dataset
df = pd.read_csv('spam_ham_dataset.csv')

# Display the first few rows of the dataset
print(df.head())

3.2 Preprocessing the Emails

Before feeding the email text into a machine learning model, we need to clean and preprocess it. The following functions will help:

Convert to lowercase: To make the text case-insensitive.
Remove punctuation: As punctuation does not contribute to determining spam.
Remove stopwords: Common words like “the”, “is”, “and” that do not add value.

import string
import nltk
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')

# Define preprocessing functions
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing steps
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

Now, the cleaned_email column contains preprocessed emails that are ready to be vectorized.

Step 4: Vectorizing the Email Data

To convert text into a numerical format, we’ll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization, which helps the model understand the importance of each word.

Vectorizing the text data using TF-IDF refers to the process of converting raw text (like emails, reviews, or any unstructured text) into numerical features that a machine learning model can understand and work with. Since machine learning algorithms cannot directly interpret text data, we need to transform it into a format that they can process, and TF-IDF (Term Frequency-Inverse Document Frequency) is one of the most commonly used techniques for this purpose.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=3000)

# Fit and transform the cleaned email text
X = tfidf.fit_transform(df['cleaned_email']).toarray()

# Target variable (spam or ham labels)
y = df['label']

Step 5: Splitting the Data into Training and Test Sets

To evaluate the model’s performance, we split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation of the Code:

`train_test_split(X, y, test_size=0.2, random_state=42)`

X: The feature matrix. In your case, it contains the numerical TF-IDF vectors representing the emails (input data).
y: The target variable. In your case, y contains the labels for each email (whether it’s spam or not spam).
- 1 represents spam.
- 0 represents not spam (ham).
test_size=0.2: This specifies the proportion of the dataset that should be set aside for testing. Here, 20% of the data will be used for testing, and the remaining 80% will be used for training the model.
random_state=42: This is a seed value that ensures the random splitting of the data is reproducible. By setting random_state, you ensure that every time you run the code, the same split between training and testing data will occur. You can set this to any integer value, but using the same value ensures consistent results when testing the model.

Output Variables:

X_train: The training portion of the feature matrix (80% of the data). This is used to train the machine learning model.
X_test: The testing portion of the feature matrix (20% of the data). This is used to evaluate how well the model performs on unseen data.
y_train: The training portion of the target variable (y). These are the corresponding labels (spam or not spam) for the training emails.
y_test: The testing portion of the target variable (y). These are the corresponding labels for the test emails, which are used to evaluate the model’s predictions.

Purpose of Splitting the Data:

Training Set (X_train, y_train): The model is trained using this data, which consists of known inputs (emails) and outputs (whether they are spam or not).
Test Set (X_test, y_test): After the model has been trained, it is tested on this set of data that the model hasn’t seen before. The test set helps evaluate how well the model can generalize to new, unseen emails.

Step 6: Training the Naive Bayes Model

Now, let’s train the Naive Bayes model, which is often used for text classification due to its simplicity and effectiveness.

The Naive Bayes model is a family of probabilistic machine learning algorithms based on Bayes’ Theorem. It is particularly effective for classification tasks like spam detection, sentiment analysis, and text classification. Naive Bayes is called “naive” because it assumes that the features (e.g., words in an email) are independent of each other, which is often not the case in real life but still works surprisingly well in practice.

Types of Naive Bayes Models:

Multinomial Naive Bayes: Used for discrete data like word counts. This is commonly used for text classification problems, such as spam detection, where the features are word frequencies or TF-IDF scores.
Bernoulli Naive Bayes: Used when features are binary (e.g., whether a particular word appears or not). This is also used for text data, but instead of counting word frequencies, it checks whether a word is present or absent.
Gaussian Naive Bayes: Used for continuous data, where features follow a normal (Gaussian) distribution.

from sklearn.naive_bayes import MultinomialNB

# Initialize and train the model
model = MultinomialNB()
model.fit(X_train, y_train)

Step 7: Evaluating the Model

Once the model is trained, you can evaluate its performance using various metrics, such as accuracy, confusion matrix, and classification report.

accuracy_score(y_test, y_pred) is used to calculate the accuracy of your machine learning model. The accuracy metric measures how well the model’s predictions match the actual labels for the test data.

What is Accuracy?

Accuracy is defined as the ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances in the dataset. In simpler terms, it tells you the percentage of predictions the model got correct.

confusion_matrix(y_test, y_pred) is used to compute a confusion matrix, which is a summary of the prediction results for a classification problem. It shows how well your machine learning model is performing by comparing the predicted labels (y_pred) with the actual labels (y_test).

What is a Confusion Matrix?

A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of:

True Positives (TP): Correctly predicted positives (e.g., correctly predicted spam emails).
True Negatives (TN): Correctly predicted negatives (e.g., correctly predicted non-spam emails).
False Positives (FP): Incorrectly predicted positives (e.g., predicting an email as spam when it is not).
False Negatives (FN): Incorrectly predicted negatives (e.g., predicting an email as not spam when it is actually spam).

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict on the test set
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

Example of a Classification Report

              precision    recall  f1-score   support

           0       0.75      1.00      0.86         3
           1       1.00      0.50      0.67         2

    accuracy                           0.80         5
   macro avg       0.88      0.75      0.76         5
weighted avg       0.85      0.80      0.78         5

Class 0 (Not Spam):
- Precision = 0.75: Out of all emails predicted as “not spam,” 75% were actually not spam.
- Recall = 1.00: The model correctly identified 100% of the “not spam” emails.
- F1-Score = 0.86: This is the harmonic mean of precision and recall for “not spam” emails.
- Support = 3: There are 3 “not spam” emails in the test set.
Class 1 (Spam):
- Precision = 1.00: Out of all emails predicted as “spam,” 100% were actually spam.
- Recall = 0.50: The model correctly identified 50% of the actual spam emails.
- F1-Score = 0.67: This is the harmonic mean of precision and recall for “spam” emails.
- Support = 2: There are 2 spam emails in the test set.
Overall Metrics:
- Accuracy = 0.80: The model’s overall accuracy is 80% (i.e., it correctly classified 80% of the emails).
- Macro Avg: This is the unweighted average of precision, recall, and F1-score across all classes.
- Weighted Avg: This is the weighted average of precision, recall, and F1-score, where each class’s contribution is weighted by its support (i.e., the number of occurrences in the test set).

Step 8: Testing the Model with a New Email

You can now test the model with a new email to see if it correctly classifies it as spam or not spam.

def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Test the model with a new email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, model, tfidf)
print(result)

Complete Code

import os
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Set up the environment (only run these commands in the terminal)
# python -m venv venv
# ./venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install pandas scikit-learn numpy nltk --cache-dir "D:/internship/supervised_learning/email_spam_detection/.cache"

# Step 2: Load and process email files, create the dataset

# Define paths to the directories containing the spam and ham emails
spam_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam_detection/datasets/easy_ham'

# Function to read all email files and store them in a list
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))  # Tuple (email_content, label)
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)  # 1 for spam
ham_emails = load_emails_from_directory(ham_dir, 0)    # 0 for ham

# Combine spam and ham into a single list
all_emails = spam_emails + ham_emails

# Create a DataFrame with two columns: 'email' and 'label'
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Save the DataFrame to a CSV file
df.to_csv('spam_ham_dataset.csv', index=False)
print(f'Dataset saved with {len(df)} emails.')

# Step 3: Preprocess the emails
nltk.download('stopwords')

# Define preprocessing functions
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

# Step 4: Vectorize the text data using TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['cleaned_email']).toarray()
y = df['label']

# Step 5: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 7: Evaluate the model
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Step 8: Test the model with a new email
def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Test the model with a new email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, model, tfidf)
print(result)

Code with model saving and loading from disk

import os
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import joblib

# Step 1: Load and process email files, create the dataset

# Define paths to the directories containing the spam and ham emails
spam_dir = 'D:/internship/supervised_learning/email_spam/datasets/spam'
ham_dir = 'D:/internship/supervised_learning/email_spam/datasets/easy_ham'

# Function to read all email files and store them in a list
def load_emails_from_directory(directory, label):
    emails = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='latin-1') as file:
            email_content = file.read()
            emails.append((email_content, label))  # Tuple (email_content, label)
    return emails

# Load spam and ham emails
spam_emails = load_emails_from_directory(spam_dir, 1)  # 1 for spam
ham_emails = load_emails_from_directory(ham_dir, 0)    # 0 for ham

# Combine spam and ham into a single list
all_emails = spam_emails + ham_emails

# Create a DataFrame with two columns: 'email' and 'label'
df = pd.DataFrame(all_emails, columns=['email', 'label'])

# Step 2: Preprocess the emails
nltk.download('stopwords')

# Define preprocessing functions
def to_lowercase(text):
    return text.lower()

def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply preprocessing
df['cleaned_email'] = df['email'].apply(lambda x: to_lowercase(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_punctuation(x))
df['cleaned_email'] = df['cleaned_email'].apply(lambda x: remove_stopwords(x))

# Step 3: Vectorize the text data using TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X = tfidf.fit_transform(df['cleaned_email']).toarray()
y = df['label']

# Step 4: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 6: Evaluate the model
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Step 7: Save the model and TF-IDF vectorizer
joblib.dump(model, 'spam_classifier_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
print("Model and vectorizer saved.")

# Step 8: Load the model and vectorizer from file
loaded_model = joblib.load('spam_classifier_model.pkl')
loaded_tfidf = joblib.load('tfidf_vectorizer.pkl')
print("Model and vectorizer loaded.")

# Step 9: Test the model with a new email
def check_spam(email_text, model, tfidf_vectorizer):
    processed_email = to_lowercase(remove_punctuation(remove_stopwords(email_text)))
    email_vector = tfidf_vectorizer.transform([processed_email])
    prediction = model.predict(email_vector)
    return "SPAM" if prediction[0] == 1 else "NOT SPAM"

# Test the loaded model with a new email
new_email = "Congratulations! You've won a free iPhone. Click here to claim your prize."
result = check_spam(new_email, loaded_model, loaded_tfidf)
print(result)

Conclusion

Congratulations! You’ve successfully built a supervised learning model for email spam detection. Here’s a recap of what we’ve covered:

Set up the project environment.
Created a dataset from email files and saved it as CSV.
Preprocessed the email data by cleaning the text.
Converted text into numerical features using TF-IDF vectorization.
Trained a Naive Bayes model.
Evaluated the model’s performance and tested it with new email data.

This project can be expanded further by experimenting with different algorithms, tuning hyperparameters, or adding more advanced text preprocessing steps like stemming and lemmatizatBuilding an Email Spam Detection Model – Supervised Learning – Session 7