From b2f6fc07519f0e305fb81e7594024902b1cdd7b0 Mon Sep 17 00:00:00 2001 From: zoonalink <zoonalink@gmail.com> Date: Tue, 27 Dec 2022 23:56:16 +0000 Subject: [PATCH] T2 - draft reflection --- T2 - Draft reflection.ipynb | 110 ++++++++++++++++++++++++++++++++++++ Task2 reflect notes.ipynb | 72 +++++++++++++++++++++++ Task2 supplementary.ipynb | 2 +- 3 files changed, 183 insertions(+), 1 deletion(-) create mode 100644 T2 - Draft reflection.ipynb diff --git a/T2 - Draft reflection.ipynb b/T2 - Draft reflection.ipynb new file mode 100644 index 0000000..7388307 --- /dev/null +++ b/T2 - Draft reflection.ipynb @@ -0,0 +1,110 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Task 2 Reflection" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "The purpose of this report is to reflect on my code development process for Task 2 of the coursework for `UFCFQV-15-M Programming_for_Data_Science.`\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Code Description\n", + "\n", + "Task_2 requires undertaking a short 'data science' project, making use of Python libraries such as `pandas`, `numpy`, `matplotlib`, `seaborn` and `scipy`. The project is to import two datasets, merge them, clean the data and then analyse the data, including visualisation and appropriate statistical testing. The project is presented in a Jupyter notebook. \n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Development Process\n", + "\n", + "As with Task_1, my development process roughly followed an iterative CRISP-DM approach. The initial reuquirements (FRs) were prescriptive and straight-foward, so I was able to get started quickly. \n", + "\n", + "Following the import, merge, filter and clean tasks, I started the analysis stage. This was very much in the <i>exploratory</i> spirit of Exploratory Data Analysis (EDA). \n", + "\n", + "I spent a significant time in this space, exploring the data and visualising it, making regular discoveries about the data. Being new to Python, it was a steep learning curve, where I knew what I wanted to achieve but was not always able to do so in a timely manner. I also needed to spend time refreshing my knowledge of statistics, as I have not studied it for many years." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Code Evaluation\n", + "\n", + "The initial tasks (FR7-9) were fairly straight-forward in that it was a case of following the instructions. I did not encounter any problems with these tasks. That said, I always wonder whether there is a <i>more efficient</i> or <i>standard</i> way of doing things. \n", + "\n", + "I made use of what I have encountered so far in terms of Python libraries, in particular, `pandas` for dataframe manipulation.\n", + "\n", + "FR10-13 were much more challenging (and interesting) because they are not prescriptive. Whilst undertaking EDA, I found that I was simultaneously considering all four requirements and that I would need to unpick it for the purposes of the coursework later. \n", + "\n", + "I found that I wanted to further 'clean' and 'prepare' the data before presenting any visualisations for FR10 - but this was as a result of having spend a lot of time exploring the dataset, visually as well as statistically. \n", + "\n", + "My code for FR10 therefore begins by processing the data further before exploring it visually. I produced a lot of visuals using both matplotlib and seaborn, experimenting with the many options, occasionally getting lost in the process but always learning something new and how powerful these libraries can be. \n", + "\n", + "I believe that my code is very thorough and shows relevant, helpful output. I have made it as readable as possible, commenting where necessary, including adding markdown cells to present the story of the data. \n", + "\n", + "I would say that I may be presenting too much of the same visualisations - e.g. distributions. This is partly because this is coursework (and the audience is not typical) but also because I am not sure, yet, which visualisation is preferrable. \n", + "\n", + "FR11-13 are more succinct as I settled into an approach, stylistic, convention (comments, naming, etc.) but still very thorough - i.e., I spend time on different statistical tests, visualised the variables and their relationships, practiced git, markdown and Jupyter notebook.\n", + "\n", + "### Strengths\n", + "\n", + "* use of different libraries\n", + "* attempt to find a style / approach / format\n", + "* thoroughness\n", + "\n", + "### Weaknesses\n", + "\n", + "* too much of the same visualisations?\n", + "* duplication - i.e. multiple similar visualisations or statistical tests (although in my defence, I want to learn and pracice...)\n", + "\n", + "### Future Improvements\n", + "\n", + "* learning standards, best practice, efficiencies, more libraries\n", + "* balance code / comment / markdown\n", + "* \n", + "\n", + "## Conclusion / Summary\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "3a85823825384e2f260493b9b35c69d8eaac198ff59bb0d6c0e72fffbde301e2" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/Task2 reflect notes.ipynb b/Task2 reflect notes.ipynb index 3c9b920..3f935c2 100644 --- a/Task2 reflect notes.ipynb +++ b/Task2 reflect notes.ipynb @@ -26,6 +26,78 @@ "* not enough time" ] }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### <b> Introduction </b>\n", + "\n", + "The purpose of this report is to provide a short, critical self-assessment of my code development process for Task 1 of the coursework for `UFCFVQ-15-M Programming_for_Data_Science`. \n", + "\n", + "### <b> Code Description </b>\n", + "Task_1 requires writing functions in order, to ultimately calculate Pearson’s Correlation Coefficients (PCCs) for pairs of variables in a given data file, without using imported Python libraries, and printing a decent-looking table. \n", + "\n", + "Functional requirements (FRs):\n", + "\n", + "| FR | Description |\n", + "|-----|-----------------------|\n", + "| FR1 | Arithmetic mean | \n", + "| FR2 | Read column from file | \n", + "| FR3 | Read file | \n", + "| FR4 | PCC for two lists | \n", + "| FR5 | PCC for file | \n", + "| FR6 | Print table | \n", + "\n", + "The code was developed in a Jupyter notebook using a Python 3.11 kernel. \n", + "\n", + "\n", + "### <b> Development Process</b>\n", + "\n", + "My development process made use of the task’s inherent structure, allowing me to plan, develop and test each FR independently, before combining as needed. This was especially useful for more complex FRs, which required significant iteration and testing before achieving the desired results.\n", + "\n", + "\n", + "I used a modified crisp-dm approach, understanding the requirements, then cycling through iterations of pseudocode, Python code and testing until achieving the desired results. I found it very effective, but also that I can occasionally go “off-piste” in the iterations, which can be time-consuming, frustrating and ultimately less productive. \n", + "\n", + "\n", + " \n", + "I made conscious use of “new-to-me” tools and techniques like Git, VS_Code, Jupyter notebooks, Markdown.\n", + "\n", + "### <b> Code Evaluation </b>\n", + "Overall, I am pleased with my code - functions achieve the requirements (as interpreted) and they <i>feel</i> efficient and robust. \n", + "\n", + "Principles in mind when writing functions:\n", + "\n", + "* Future-proofed: generic, flexible, adaptable to allow reusability\n", + "* User-friendly, by adding assertions and error-handling\n", + "* Unambiguous, self-explanatory naming of functions and variables\n", + "* Helpful comments/docstrings by balancing approaches like DRY (Don’t Repeat Yourself), WET (Write Everything Twice), KISS (Keep it Simple, Stupid)\n", + "\n", + "#### <b> Strengths </b> \n", + "* Well-commented, functioning code\n", + "* Consistent Git use for version control\n", + "* Kept working notes\n", + "\n", + "#### <b> Improvements / To-do </b>\n", + "* Perhaps over-commented; erred on side of caution\n", + "* Establish preferred naming convention – camelCase, snake_case\n", + "* Learn Python conventions \n", + "* Don’t get side-tracked when testing\n", + "\t* Update pseudo code\n", + "\t\n", + "[Archived reflective notes by task](archived\\Task1_FR_reflections.md)\n", + "\n", + "\n", + "#### <b> Summary </b>\n", + "I found this task both appealing and beneficial. It allowed me to build a useful function from the ground up, making use of different Python coding techniques and data structures whilst also employing version control and applying appropriate metadata to the code.\n", + "\n", + "I am super-keen to keep learning for my personal and professional development, picking up best practice, standard approaches and avoiding pitfalls. This task allowed me to practice all of this. \n", + "\n", + "When it comes to Python, I am amazed at the many possibilities of solving the same scenario – this can make it challenging to identify the ‘best approach,’ if it exists. This is something I will need to get used to and embrace.\n", + "\n", + "\n" + ] + }, { "attachments": {}, "cell_type": "markdown", diff --git a/Task2 supplementary.ipynb b/Task2 supplementary.ipynb index 12f2e7b..38f9454 100644 --- a/Task2 supplementary.ipynb +++ b/Task2 supplementary.ipynb @@ -681,7 +681,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.0" + "version": "3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]" }, "orig_nbformat": 4, "vscode": { -- GitLab