README.md



Phishing Detection - Dissertation

System Overview
This project implements a multi-agent email analysis and phishing detection system using LangGraph as the orchestration framework. The architecture is modular, combining multiple specialized agents to handle different tasks in the pipeline—from content classification to data extraction, user interaction, and email management.


Supervisor Structure
The core of the system is defined in the supervisor_structure.py file. It contains the full workflow logic and integrates the following agents:


Content Analysis Bot

Utilizes an ensemble of fine-tuned models to classify email content. The models used are (note, you need a free Kaggle account):

Phish3 (LLaMA 3.2 3B, QLoRA)
BERTPhish (BERT Fine-Tuned)


Extractor Bot

Parses and extracts relevant metadata and contextual signals from incoming emails.


ChatSupervisor

Manages user-facing conversational flow, including clarifications and guided actions based on analysis results.


Supervisor Agent

Coordinates all sub-agents and maintains the global logic of the system through LangGraph.


Additional Components


Outlook Analysis Bot

A separate agent script, integrated into the main system, that handles specific analysis tasks related to Microsoft Outlook emails. This agent operates independently and is imported into the supervisor flow.


Frontend Pages

Front-end interfaces enabling users to interact with the system visually. These may include email viewing, classification results, and actionable prompts.


Backend Server (backend.py)

Acts as the core API and runtime environment for the system. Key responsibilities include:

Handling user queries and routing them through the LangGraph supervisor flow
Communicating with the Microsoft Graph API to retrieve user emails
Executing delete actions on user emails when requested


System Usage
Users can interact with the system either by sending text-based prompts or by uploading files for automated analysis. The system supports the following file types:

.json
.txt
.eml
.msg

Uploaded files are parsed and routed through the appropriate agent flow for content extraction, classification, and user feedback.
The system uses an ensemble of two fine-tuned models within the Content Analysis Bot to provide high-confidence phishing detection:


Phish3: A fine-tuned LLaMA 3.2 3B model using QLoRA, optimized for phishing email classification.

BERTPhish: A BERT-based phishing classifier trained on the same dataset.

These models are used in combination to improve accuracy, drawing on a diverse and balanced corpus of phishing and legitimate emails. The system then passes results and recommendations to downstream agents for display, user interaction, or follow-up action.


Dataset Overview
The fine-tuning dataset was carefully curated to ensure broad coverage, balance, and real-world diversity. It contains an equal number of phishing and legitimate emails to prevent class imbalance, and sources were selected to reflect a wide range of phishing tactics and legitimate email structures.
To prevent overfitting and evaluation bias, the dataset used for fine-tuning was kept entirely separate from the dataset used for ensemble model training. This strict separation ensures the fine-tuned model cannot “memorize” the evaluation data and thus promotes more realistic performance.

Data Sources


Phishing Honeypot

A publicly available GitHub repository that aggregates phishing emails submitted by users and institutions for academic study. It provides a reliable stream of real-world phishing attempts used in the wild.


Phish Bowls

Collections of phishing emails published by universities (e.g., Cornell University), often based on actual threats received by their campus communities. These are particularly valuable for capturing up-to-date tactics.


CEAS 2008 Dataset

A well-known dataset originating from the Conference on Email and Anti-Spam, sponsored by tech giants like Microsoft, Google, and IBM. While older, its volume and structure add foundational depth to phishing detection models when paired with more recent sources.


Nahmias et al. (2024) Dataset

From the study “Prompted Contextual Vectors for Spear Phishing Detection”, this dataset includes both real phishing emails and ones generated by large language models (LLMs). The inclusion of synthetic data helps improve the model’s generalization to novel attacks.


ENRON Email Dataset

A comprehensive and widely-used dataset of legitimate corporate emails from the Enron Corporation. These emails serve as high-quality, realistic examples of non-malicious correspondence, forming the legitimate half of the dataset.


All datasets were preprocessed to remove duplicates, standardize formatting, and anonymize any sensitive content.
Phishing and legitimate emails were evenly sampled to maintain class balance throughout training.


Summary
This agent system provides an end-to-end pipeline for analyzing, classifying, and managing phishing emails. With a scalable and modular design, it can be extended to support additional LLMs, logic layers, and enterprise-specific integrations.