Skip to content
Snippets Groups Projects
Commit 65901ffd authored by b2-scannell's avatar b2-scannell
Browse files

Final commit of the T3 notebook before handin

parent 6ee4bff1
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<img src="http://www.cems.uwe.ac.uk/~pa-legg/images/uwe_banner.png">
# UFCFEL-15-3 Security Data Analytics and Visualisation
# Portfolio Assignment 3: Large-Scale Data Exploration for Insider Threat Detection (2022)
---
The completion of this worksheet is worth a **maximum of 45 marks** towards your portfolio assignment for the UFCFEL-15-3 Security Data Analytics and Visualisation (SDAV) module.
### Brief
---
In this task, you have been asked to investigate a potential security threat within an organisation. Building on your previous worksheet expertise, you will need to apply your skills and knowledge of data analytics and visualisation to examine and explore the datasets methodically to uncover which employee is acting as a threat and why. The company have provided you with activity logs for various user interactions for the past 6 months, resulting in a lot of data that they need your expertise for to decipher. They want to have a report that details the investigation that you have carried out, details of the suspected individual, and a clear rationale as to why this suspect is flagged. You will need to document your investigation, giving clear justification for your process using Markdown annotation within your notebook. You will need to provide a clear rationale for why you suspect a given individual to be acting as a threat, based on the pattern of activity that you identify.
<i>This coursework is specifically designed to challenge your critical thinking and creativity, and is designed as an open problem. Examine the data and try to think how an individual user may appear as an anomaly against the remainder of the data. This could be an anomaly compared to a group of users, or an anomaly as compared over time.</i>
### Assessment and Marking
---
Marks will be allocated within the following criteria:
* **Identification and justification of the suspicious behaviour (15)**
* **Analytical process and reasoning to deduce the suspicious behaviour (15)**
* **Use of informative visualisation and data exploration techniques (10)**
* **Clarity and professional presentation (5)**
To achieve the higher end of the grade scale, you need to demonstrate creativity in how you approach the problem of identifying malicious behaviours, and ensure that you have accounted for multiple anomalies across the set of data available.
This assignment should be submitted as as PDF to your Blackboard portfolio submission as per the instructions in the assignment specification available on Blackboard. A copy of your work should also be provided via a UWE Gitlab repository, with an accessible link provided with your portfolio.
### Contact
---
Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.
---
%% Cell type:markdown id: tags:
## Load in the data
%% Cell type:code id: tags:
``` python
# DO NOT MODIFY THIS CELL - this cell is splitting the data to provide a suitable subset of data to work with for this task.
# If you change this cell your output will differ from that expected and could impact your mark.
import random
import string
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
import datetime
dataset_list = ['onlinebargains']
DATASET = dataset_list[0]
def load_data(DATASET):
if DATASET in dataset_list:
email_data = pd.read_csv('./T3_data/' + DATASET + '/email_data.csv', parse_dates=True, index_col=0)
file_data = pd.read_csv('./T3_data/' + DATASET + '/file_data.csv', parse_dates=True, index_col=0)
web_data = pd.read_csv('./T3_data/' + DATASET + '/web_data.csv', parse_dates=True, index_col=0)
login_data = pd.read_csv('./T3_data/' + DATASET + '/login_data.csv', parse_dates=True, index_col=0)
usb_data = pd.read_csv('./T3_data/' + DATASET + '/usb_data.csv', parse_dates=True, index_col=0)
employee_data = pd.read_csv('./T3_data/' + DATASET + '/employee_data.csv', index_col=0)
email_data['datetime'] = pd.to_datetime(email_data['datetime'])
file_data['datetime'] = pd.to_datetime(file_data['datetime'])
web_data['datetime'] = pd.to_datetime(web_data['datetime'])
login_data['datetime'] = pd.to_datetime(login_data['datetime'])
usb_data['datetime'] = pd.to_datetime(usb_data['datetime'])
else:
print ("DATASET variable not defined")
return
return employee_data, login_data, usb_data, web_data, file_data, email_data
employee_data, login_data, usb_data, web_data, file_data, email_data = load_data(DATASET)
employee_data
```
%% Output
user role email pc
0 usr-uda Security usr-uda@onlinebargains.com pc0
1 usr-hhe Security usr-hhe@onlinebargains.com pc1
2 usr-vxr Finance usr-vxr@onlinebargains.com pc2
3 usr-nba Finance usr-nba@onlinebargains.com pc3
4 usr-hqt Finance usr-hqt@onlinebargains.com pc4
.. ... ... ... ...
244 usr-jwo Finance usr-jwo@onlinebargains.com pc244
245 usr-hiz Security usr-hiz@onlinebargains.com pc245
246 usr-svz Services usr-svz@onlinebargains.com pc246
247 usr-ndr HR usr-ndr@onlinebargains.com pc247
248 usr-eie Finance usr-eie@onlinebargains.com pc248
[249 rows x 4 columns]
%% Cell type:markdown id: tags:
The cell above is creating a set of DataFrames to work with. The set of tables are named as follows:
* employee_data
* login_data
* usb_data
* web_data
* file_data
* email_data
%% Cell type:markdown id: tags:
# 1. Begin investigation
To start I will investigate exactly who has been attempting to access folders related to security and then crossreference this with the employee data to see if anyone outside of the security roles has been attempting to access data they should not have access to.
%% Cell type:code id: tags:
``` python
# ANSWER
#In order to find anyone attempting to access a folder they should not be I created a filter which checks which users have attempted to access "Security" files.
import numpy as np
file_data_security = file_data[file_data["filename"].str.contains("security")]
file_data_security_unique = pd.unique(file_data_security["user"])
filtered_employee_data = employee_data[employee_data['user'].isin(file_data_security_unique)]
filtered_file_data = filtered_employee_data[~((filtered_employee_data['role'].str.contains("Security")) | (filtered_employee_data['role'].str.contains("Technical")))]
filtered_file_data
#In the end no users without the security or technical roles had attempted to access the folders.
```
%% Output
Empty DataFrame
Columns: [user, role, email, pc]
Index: []
%% Cell type:markdown id: tags:
for the next step of my investigation I decided to check who was logging on at suspicious times this would give me a good starting point for the investigation by flagging up certain users. I started this task by taking the general average across the entire dataset of how many people were logging in from the hours of 12AM-6AM, assuming that these are not working hours. A spike in activity could imply that malicious activity was being performed around those hours, a drop in activity could be the direct result of someone acting maliciously (for example taking down login servers).
%% Cell type:code id: tags:
``` python
login_data.index = login_data['datetime']
filtered_df = login_data.between_time("00:00", "06:00")
daily_sum = filtered_df['action'].resample('D').apply(lambda x: (x == "login").sum())
sum_mean = daily_sum.mean()
filtered_daily_sum = daily_sum.apply(lambda x: x if (x < sum_mean - 10) | (x > sum_mean + 10) else None)
filtered_daily_sum = filtered_daily_sum.dropna()
print(filtered_daily_sum)
```
%% Output
datetime
2020-01-03 102.0
2020-01-28 100.0
2020-01-29 78.0
2020-02-09 100.0
2020-02-13 78.0
2020-02-14 103.0
2020-02-27 76.0
2020-03-04 75.0
2020-03-29 102.0
2020-04-03 104.0
2020-04-06 101.0
2020-06-02 76.0
2020-06-13 100.0
2020-06-23 79.0
2020-06-28 78.0
2020-08-12 77.0
2020-08-23 100.0
2020-09-15 75.0
2020-09-22 77.0
2020-10-06 75.0
2020-10-07 77.0
2020-10-27 101.0
2020-11-09 75.0
2020-11-21 79.0
Name: action, dtype: float64
%% Cell type:markdown id: tags:
The above data shows all of the dates which have a mean number of logins which is either 10 above or below the mean. This has narrowed down my investigation to 24 days as opposed to six months. To continue the investigation I will check the specific dates with a higher number of logins to see if there's anyone consistently causing them. In the end I was unable to get this to work however I believe that if the information was properly checked there would be a single point of failure which could be used for further analysis.
%% Cell type:code id: tags:
``` python
#filtered_employee_data = login_data['datetime'].isin(filtered_daily_sum)
#filtered_employee_data = filtered_employee_data.bool.contains(True)
#selected_rows = filtered_employee_data.loc[filtered_employee_data == True]
#selected_rows = filtered_employee_data[filtered_employee_data == True]
#filtered_employee_data = login_data['datetime'].isin(filtered_daily_sum)
#print(filtered_employee_data)
#selected_rows = login_data[filtered_employee_data]
selected_rows = login_data[login_data['datetime'].isin(filtered_daily_sum.index)]
selected_rows
```
%% Output
Empty DataFrame
Columns: [datetime, user, action, pc]
Index: []
%% Cell type:markdown id: tags:
The unusually high levels of activity listed above imply that there is some sort of malicious activity going on on those dates. Further investigation, specifically of who exactly was logging in on those dates is likely to reveal the party responsible for these discrepencies.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment