Final commit of the T3 notebook before handin

65901ffd · b2-scannell · 6ee4bff1 · 65901ffd
Commit 65901ffd authored 2 years ago by b2-scannell
--- a/SDAV-T3-Notebook-2022-STUDENT.ipynb
+++ b/SDAV-T3-Notebook-2022-STUDENT.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"http://www.cems.uwe.ac.uk/~pa-legg/images/uwe_banner.png\">\n",
+    "\n",
+    "# UFCFEL-15-3 Security Data Analytics and Visualisation\n",
+    "# Portfolio Assignment 3: Large-Scale Data Exploration for Insider Threat Detection  (2022)\n",
+    "---\n",
+    "\n",
+    "The completion of this worksheet is worth a **maximum of 45 marks** towards your portfolio assignment for the UFCFEL-15-3 Security Data Analytics and Visualisation (SDAV) module.\n",
+    "\n",
+    "### Brief\n",
+    "---\n",
+    "\n",
+    "In this task, you have been asked to investigate a potential security threat within an organisation. Building on your previous worksheet expertise, you will need to apply your skills and knowledge of data analytics and visualisation to examine and explore the datasets methodically to uncover which employee is acting as a threat and why. The company have provided you with activity logs for various user interactions for the past 6 months, resulting in a lot of data that they need your expertise for to decipher. They want to have a report that details the investigation that you have carried out, details of the suspected individual, and a clear rationale as to why this suspect is flagged. You will need to document your investigation, giving clear justification for your process using Markdown annotation within your notebook. You will need to provide a clear rationale for why you suspect a given individual to be acting as a threat, based on the pattern of activity that you identify.\n",
+    "\n",
+    "<i>This coursework is specifically designed to challenge your critical thinking and creativity, and is designed as an open problem. Examine the data and try to think how an individual user may appear as an anomaly against the remainder of the data. This could be an anomaly compared to a group of users, or an anomaly as compared over time.</i>\n",
+    "\n",
+    "### Assessment and Marking\n",
+    "---\n",
+    "\n",
+    "Marks will be allocated within the following criteria:\n",
+    "\n",
+    "* **Identification and justification of the suspicious behaviour (15)**\n",
+    "* **Analytical process and reasoning to deduce the suspicious behaviour (15)**\n",
+    "* **Use of informative visualisation and data exploration techniques (10)**\n",
+    "* **Clarity and professional presentation (5)**\n",
+    "\n",
+    "To achieve the higher end of the grade scale, you need to demonstrate creativity in how you approach the problem of identifying malicious behaviours, and ensure that you have accounted for multiple anomalies across the set of data available.\n",
+    "\n",
+    "This assignment should be submitted as as PDF to your Blackboard portfolio submission as per the instructions in the assignment specification available on Blackboard. A copy of your work should also be provided via a UWE Gitlab repository, with an accessible link provided with your portfolio.\n",
+    "\n",
+    "### Contact\n",
+    "---\n",
+    "\n",
+    "Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.\n",
+    "\n",
+    "---\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load in the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user</th>\n",
+       "      <th>role</th>\n",
+       "      <th>email</th>\n",
+       "      <th>pc</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>usr-uda</td>\n",
+       "      <td>Security</td>\n",
+       "      <td>usr-uda@onlinebargains.com</td>\n",
+       "      <td>pc0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>usr-hhe</td>\n",
+       "      <td>Security</td>\n",
+       "      <td>usr-hhe@onlinebargains.com</td>\n",
+       "      <td>pc1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>usr-vxr</td>\n",
+       "      <td>Finance</td>\n",
+       "      <td>usr-vxr@onlinebargains.com</td>\n",
+       "      <td>pc2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>usr-nba</td>\n",
+       "      <td>Finance</td>\n",
+       "      <td>usr-nba@onlinebargains.com</td>\n",
+       "      <td>pc3</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>usr-hqt</td>\n",
+       "      <td>Finance</td>\n",
+       "      <td>usr-hqt@onlinebargains.com</td>\n",
+       "      <td>pc4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>244</th>\n",
+       "      <td>usr-jwo</td>\n",
+       "      <td>Finance</td>\n",
+       "      <td>usr-jwo@onlinebargains.com</td>\n",
+       "      <td>pc244</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>245</th>\n",
+       "      <td>usr-hiz</td>\n",
+       "      <td>Security</td>\n",
+       "      <td>usr-hiz@onlinebargains.com</td>\n",
+       "      <td>pc245</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>246</th>\n",
+       "      <td>usr-svz</td>\n",
+       "      <td>Services</td>\n",
+       "      <td>usr-svz@onlinebargains.com</td>\n",
+       "      <td>pc246</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>247</th>\n",
+       "      <td>usr-ndr</td>\n",
+       "      <td>HR</td>\n",
+       "      <td>usr-ndr@onlinebargains.com</td>\n",
+       "      <td>pc247</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>248</th>\n",
+       "      <td>usr-eie</td>\n",
+       "      <td>Finance</td>\n",
+       "      <td>usr-eie@onlinebargains.com</td>\n",
+       "      <td>pc248</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>249 rows × 4 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "        user      role                       email     pc\n",
+       "0    usr-uda  Security  usr-uda@onlinebargains.com    pc0\n",
+       "1    usr-hhe  Security  usr-hhe@onlinebargains.com    pc1\n",
+       "2    usr-vxr   Finance  usr-vxr@onlinebargains.com    pc2\n",
+       "3    usr-nba   Finance  usr-nba@onlinebargains.com    pc3\n",
+       "4    usr-hqt   Finance  usr-hqt@onlinebargains.com    pc4\n",
+       "..       ...       ...                         ...    ...\n",
+       "244  usr-jwo   Finance  usr-jwo@onlinebargains.com  pc244\n",
+       "245  usr-hiz  Security  usr-hiz@onlinebargains.com  pc245\n",
+       "246  usr-svz  Services  usr-svz@onlinebargains.com  pc246\n",
+       "247  usr-ndr        HR  usr-ndr@onlinebargains.com  pc247\n",
+       "248  usr-eie   Finance  usr-eie@onlinebargains.com  pc248\n",
+       "\n",
+       "[249 rows x 4 columns]"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# DO NOT MODIFY THIS CELL - this cell is splitting the data to provide a suitable subset of data to work with for this task.\n",
+    "# If you change this cell your output will differ from that expected and could impact your mark.\n",
+    "\n",
+    "import random\n",
+    "import string\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn\n",
+    "import datetime\n",
+    "\n",
+    "dataset_list = ['onlinebargains']\n",
+    "DATASET = dataset_list[0]\n",
+    "\n",
+    "def load_data(DATASET):\n",
+    "    if DATASET in dataset_list:\n",
+    "        email_data = pd.read_csv('./T3_data/' + DATASET + '/email_data.csv', parse_dates=True, index_col=0)\n",
+    "        file_data = pd.read_csv('./T3_data/' + DATASET + '/file_data.csv', parse_dates=True, index_col=0)\n",
+    "        web_data = pd.read_csv('./T3_data/' + DATASET + '/web_data.csv', parse_dates=True, index_col=0)\n",
+    "        login_data = pd.read_csv('./T3_data/' + DATASET + '/login_data.csv', parse_dates=True, index_col=0)\n",
+    "        usb_data = pd.read_csv('./T3_data/' + DATASET + '/usb_data.csv', parse_dates=True, index_col=0)\n",
+    "        employee_data = pd.read_csv('./T3_data/' + DATASET + '/employee_data.csv', index_col=0)\n",
+    "        \n",
+    "        email_data['datetime'] = pd.to_datetime(email_data['datetime'])\n",
+    "        file_data['datetime'] = pd.to_datetime(file_data['datetime'])\n",
+    "        web_data['datetime'] = pd.to_datetime(web_data['datetime'])\n",
+    "        login_data['datetime'] = pd.to_datetime(login_data['datetime'])\n",
+    "        usb_data['datetime'] = pd.to_datetime(usb_data['datetime'])\n",
+    "    else:\n",
+    "        print (\"DATASET variable not defined\")\n",
+    "        return\n",
+    "    return employee_data, login_data, usb_data, web_data, file_data, email_data\n",
+    "\n",
+    "employee_data, login_data, usb_data, web_data, file_data, email_data = load_data(DATASET)\n",
+    "employee_data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The cell above is creating a set of DataFrames to work with. The set of tables are named as follows:\n",
+    "\n",
+    "* employee_data\n",
+    "* login_data\n",
+    "* usb_data\n",
+    "* web_data\n",
+    "* file_data\n",
+    "* email_data\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1. Begin investigation\n",
+    "\n",
+    "To start I will investigate exactly who has been attempting to access folders related to security and then crossreference this with the employee data to see if anyone outside of the security roles has been attempting to access data they should not have access to."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>user</th>\n",
+       "      <th>role</th>\n",
+       "      <th>email</th>\n",
+       "      <th>pc</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "Empty DataFrame\n",
+       "Columns: [user, role, email, pc]\n",
+       "Index: []"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# ANSWER\n",
+    "#In order to find anyone attempting to access a folder they should not be I created a filter which checks which users have attempted to access \"Security\" files.\n",
+    "import numpy as np\n",
+    "file_data_security = file_data[file_data[\"filename\"].str.contains(\"security\")]\n",
+    "file_data_security_unique = pd.unique(file_data_security[\"user\"])\n",
+    "filtered_employee_data = employee_data[employee_data['user'].isin(file_data_security_unique)]\n",
+    "filtered_file_data = filtered_employee_data[~((filtered_employee_data['role'].str.contains(\"Security\")) | (filtered_employee_data['role'].str.contains(\"Technical\")))]\n",
+    "filtered_file_data\n",
+    "#In the end no users without the security or technical roles had attempted to access the folders."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "for the next step of my investigation I decided to check who was logging on at suspicious times this would give me a good starting point for the investigation by flagging up certain users. I started this task by taking the general average across the entire dataset of how many people were logging in from the hours of 12AM-6AM, assuming that these are not working hours. A spike in activity could imply that malicious activity was being performed around those hours, a drop in activity could be the direct result of someone acting maliciously (for example taking down login servers)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "datetime\n",
+      "2020-01-03    102.0\n",
+      "2020-01-28    100.0\n",
+      "2020-01-29     78.0\n",
+      "2020-02-09    100.0\n",
+      "2020-02-13     78.0\n",
+      "2020-02-14    103.0\n",
+      "2020-02-27     76.0\n",
+      "2020-03-04     75.0\n",
+      "2020-03-29    102.0\n",
+      "2020-04-03    104.0\n",
+      "2020-04-06    101.0\n",
+      "2020-06-02     76.0\n",
+      "2020-06-13    100.0\n",
+      "2020-06-23     79.0\n",
+      "2020-06-28     78.0\n",
+      "2020-08-12     77.0\n",
+      "2020-08-23    100.0\n",
+      "2020-09-15     75.0\n",
+      "2020-09-22     77.0\n",
+      "2020-10-06     75.0\n",
+      "2020-10-07     77.0\n",
+      "2020-10-27    101.0\n",
+      "2020-11-09     75.0\n",
+      "2020-11-21     79.0\n",
+      "Name: action, dtype: float64\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "login_data.index = login_data['datetime']\n",
+    "filtered_df = login_data.between_time(\"00:00\", \"06:00\")\n",
+    "daily_sum = filtered_df['action'].resample('D').apply(lambda x: (x == \"login\").sum())\n",
+    "sum_mean = daily_sum.mean()\n",
+    "filtered_daily_sum = daily_sum.apply(lambda x: x if (x < sum_mean - 10) | (x > sum_mean + 10) else None)\n",
+    "filtered_daily_sum = filtered_daily_sum.dropna()\n",
+    "print(filtered_daily_sum)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The above data shows all of the dates which have a mean number of logins which is either 10 above or below the mean. This has narrowed down my investigation to 24 days as opposed to six months. To continue the investigation I will check the specific dates with a higher number of logins to see if there's anyone consistently causing them. In the end I was unable to get this to work however I believe that if the information was properly checked there would be a single point of failure which could be used for further analysis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>datetime</th>\n",
+       "      <th>user</th>\n",
+       "      <th>action</th>\n",
+       "      <th>pc</th>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>datetime</th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "      <th></th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "Empty DataFrame\n",
+       "Columns: [datetime, user, action, pc]\n",
+       "Index: []"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "#filtered_employee_data = login_data['datetime'].isin(filtered_daily_sum)\n",
+    "#filtered_employee_data = filtered_employee_data.bool.contains(True)\n",
+    "#selected_rows = filtered_employee_data.loc[filtered_employee_data == True]\n",
+    "#selected_rows = filtered_employee_data[filtered_employee_data == True]\n",
+    "#filtered_employee_data = login_data['datetime'].isin(filtered_daily_sum)\n",
+    "#print(filtered_employee_data)\n",
+    "#selected_rows = login_data[filtered_employee_data]\n",
+    "selected_rows = login_data[login_data['datetime'].isin(filtered_daily_sum.index)]\n",
+    "selected_rows"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The unusually high levels of activity listed above imply that there is some sort of malicious activity going on on those dates. Further investigation, specifically of who exactly was logging in on those dates is likely to reveal the party responsible for these discrepencies."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  },
+  "widgets": {
+   "state": {},
+   "version": "1.1.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
+%% Cell type:markdown id: tags:
+
+<img src="http://www.cems.uwe.ac.uk/~pa-legg/images/uwe_banner.png">
+
+# UFCFEL-15-3 Security Data Analytics and Visualisation
+# Portfolio Assignment 3: Large-Scale Data Exploration for Insider Threat Detection  (2022)
+---
+
+The completion of this worksheet is worth a **maximum of 45 marks** towards your portfolio assignment for the UFCFEL-15-3 Security Data Analytics and Visualisation (SDAV) module.
+
+### Brief
+---
+
+In this task, you have been asked to investigate a potential security threat within an organisation. Building on your previous worksheet expertise, you will need to apply your skills and knowledge of data analytics and visualisation to examine and explore the datasets methodically to uncover which employee is acting as a threat and why. The company have provided you with activity logs for various user interactions for the past 6 months, resulting in a lot of data that they need your expertise for to decipher. They want to have a report that details the investigation that you have carried out, details of the suspected individual, and a clear rationale as to why this suspect is flagged. You will need to document your investigation, giving clear justification for your process using Markdown annotation within your notebook. You will need to provide a clear rationale for why you suspect a given individual to be acting as a threat, based on the pattern of activity that you identify.
+
+<i>This coursework is specifically designed to challenge your critical thinking and creativity, and is designed as an open problem. Examine the data and try to think how an individual user may appear as an anomaly against the remainder of the data. This could be an anomaly compared to a group of users, or an anomaly as compared over time.</i>
+
+### Assessment and Marking
+---
+
+Marks will be allocated within the following criteria:
+
+* **Identification and justification of the suspicious behaviour (15)**
+* **Analytical process and reasoning to deduce the suspicious behaviour (15)**
+* **Use of informative visualisation and data exploration techniques (10)**
+* **Clarity and professional presentation (5)**
+
+To achieve the higher end of the grade scale, you need to demonstrate creativity in how you approach the problem of identifying malicious behaviours, and ensure that you have accounted for multiple anomalies across the set of data available.
+
+This assignment should be submitted as as PDF to your Blackboard portfolio submission as per the instructions in the assignment specification available on Blackboard. A copy of your work should also be provided via a UWE Gitlab repository, with an accessible link provided with your portfolio.
+
+### Contact
+---
+
+Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You can use the Blackboard Q&A feature to ask questions related to this module and this assignment, as well as the on-site teaching sessions.
+
+---
+
+%% Cell type:markdown id: tags:
+
+## Load in the data
+
+%% Cell type:code id: tags:
+
+``` python
+# DO NOT MODIFY THIS CELL - this cell is splitting the data to provide a suitable subset of data to work with for this task.
+# If you change this cell your output will differ from that expected and could impact your mark.
+
+import random
+import string
+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn
+import datetime
+
+dataset_list = ['onlinebargains']
+DATASET = dataset_list[0]
+
+def load_data(DATASET):
+    if DATASET in dataset_list:
+        email_data = pd.read_csv('./T3_data/' + DATASET + '/email_data.csv', parse_dates=True, index_col=0)
+        file_data = pd.read_csv('./T3_data/' + DATASET + '/file_data.csv', parse_dates=True, index_col=0)
+        web_data = pd.read_csv('./T3_data/' + DATASET + '/web_data.csv', parse_dates=True, index_col=0)
+        login_data = pd.read_csv('./T3_data/' + DATASET + '/login_data.csv', parse_dates=True, index_col=0)
+        usb_data = pd.read_csv('./T3_data/' + DATASET + '/usb_data.csv', parse_dates=True, index_col=0)
+        employee_data = pd.read_csv('./T3_data/' + DATASET + '/employee_data.csv', index_col=0)
+
+        email_data['datetime'] = pd.to_datetime(email_data['datetime'])
+        file_data['datetime'] = pd.to_datetime(file_data['datetime'])
+        web_data['datetime'] = pd.to_datetime(web_data['datetime'])
+        login_data['datetime'] = pd.to_datetime(login_data['datetime'])
+        usb_data['datetime'] = pd.to_datetime(usb_data['datetime'])
+    else:
+        print ("DATASET variable not defined")
+        return
+    return employee_data, login_data, usb_data, web_data, file_data, email_data
+
+employee_data, login_data, usb_data, web_data, file_data, email_data = load_data(DATASET)
+employee_data
+```
+
+%% Output
+
+            user      role                       email     pc
+    0    usr-uda  Security  usr-uda@onlinebargains.com    pc0
+    1    usr-hhe  Security  usr-hhe@onlinebargains.com    pc1
+    2    usr-vxr   Finance  usr-vxr@onlinebargains.com    pc2
+    3    usr-nba   Finance  usr-nba@onlinebargains.com    pc3
+    4    usr-hqt   Finance  usr-hqt@onlinebargains.com    pc4
+    ..       ...       ...                         ...    ...
+    244  usr-jwo   Finance  usr-jwo@onlinebargains.com  pc244
+    245  usr-hiz  Security  usr-hiz@onlinebargains.com  pc245
+    246  usr-svz  Services  usr-svz@onlinebargains.com  pc246
+    247  usr-ndr        HR  usr-ndr@onlinebargains.com  pc247
+    248  usr-eie   Finance  usr-eie@onlinebargains.com  pc248
+    
+    [249 rows x 4 columns]
+
+%% Cell type:markdown id: tags:
+
+The cell above is creating a set of DataFrames to work with. The set of tables are named as follows:
+
+* employee_data
+* login_data
+* usb_data
+* web_data
+* file_data
+* email_data
+
+%% Cell type:markdown id: tags:
+
+# 1. Begin investigation
+
+To start I will investigate exactly who has been attempting to access folders related to security and then crossreference this with the employee data to see if anyone outside of the security roles has been attempting to access data they should not have access to.
+
+%% Cell type:code id: tags:
+
+``` python
+# ANSWER
+#In order to find anyone attempting to access a folder they should not be I created a filter which checks which users have attempted to access "Security" files.
+import numpy as np
+file_data_security = file_data[file_data["filename"].str.contains("security")]
+file_data_security_unique = pd.unique(file_data_security["user"])
+filtered_employee_data = employee_data[employee_data['user'].isin(file_data_security_unique)]
+filtered_file_data = filtered_employee_data[~((filtered_employee_data['role'].str.contains("Security")) | (filtered_employee_data['role'].str.contains("Technical")))]
+filtered_file_data
+#In the end no users without the security or technical roles had attempted to access the folders.
+```
+
+%% Output
+
+    Empty DataFrame
+    Columns: [user, role, email, pc]
+    Index: []
+
+%% Cell type:markdown id: tags:
+
+for the next step of my investigation I decided to check who was logging on at suspicious times this would give me a good starting point for the investigation by flagging up certain users. I started this task by taking the general average across the entire dataset of how many people were logging in from the hours of 12AM-6AM, assuming that these are not working hours. A spike in activity could imply that malicious activity was being performed around those hours, a drop in activity could be the direct result of someone acting maliciously (for example taking down login servers).
+
+%% Cell type:code id: tags:
+
+``` python
+
+login_data.index = login_data['datetime']
+filtered_df = login_data.between_time("00:00", "06:00")
+daily_sum = filtered_df['action'].resample('D').apply(lambda x: (x == "login").sum())
+sum_mean = daily_sum.mean()
+filtered_daily_sum = daily_sum.apply(lambda x: x if (x < sum_mean - 10) | (x > sum_mean + 10) else None)
+filtered_daily_sum = filtered_daily_sum.dropna()
+print(filtered_daily_sum)
+```
+
+%% Output
+
+    datetime
+    2020-01-03    102.0
+    2020-01-28    100.0
+    2020-01-29     78.0
+    2020-02-09    100.0
+    2020-02-13     78.0
+    2020-02-14    103.0
+    2020-02-27     76.0
+    2020-03-04     75.0
+    2020-03-29    102.0
+    2020-04-03    104.0
+    2020-04-06    101.0
+    2020-06-02     76.0
+    2020-06-13    100.0
+    2020-06-23     79.0
+    2020-06-28     78.0
+    2020-08-12     77.0
+    2020-08-23    100.0
+    2020-09-15     75.0
+    2020-09-22     77.0
+    2020-10-06     75.0
+    2020-10-07     77.0
+    2020-10-27    101.0
+    2020-11-09     75.0
+    2020-11-21     79.0
+    Name: action, dtype: float64
+
+%% Cell type:markdown id: tags:
+
+The above data shows all of the dates which have a mean number of logins which is either 10 above or below the mean. This has narrowed down my investigation to 24 days as opposed to six months. To continue the investigation I will check the specific dates with a higher number of logins to see if there's anyone consistently causing them. In the end I was unable to get this to work however I believe that if the information was properly checked there would be a single point of failure which could be used for further analysis.
+
+%% Cell type:code id: tags:
+
+``` python
+#filtered_employee_data = login_data['datetime'].isin(filtered_daily_sum)
+#filtered_employee_data = filtered_employee_data.bool.contains(True)
+#selected_rows = filtered_employee_data.loc[filtered_employee_data == True]
+#selected_rows = filtered_employee_data[filtered_employee_data == True]
+#filtered_employee_data = login_data['datetime'].isin(filtered_daily_sum)
+#print(filtered_employee_data)
+#selected_rows = login_data[filtered_employee_data]
+selected_rows = login_data[login_data['datetime'].isin(filtered_daily_sum.index)]
+selected_rows
+```
+
+%% Output
+
+    Empty DataFrame
+    Columns: [datetime, user, action, pc]
+    Index: []
+
+%% Cell type:markdown id: tags:
+
+The unusually high levels of activity listed above imply that there is some sort of malicious activity going on on those dates. Further investigation, specifically of who exactly was logging in on those dates is likely to reveal the party responsible for these discrepencies.