Skip to content
Snippets Groups Projects
Commit f963367c authored by zoonalink's avatar zoonalink
Browse files

Task1 - updated formatting

parent c6505e7e
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# UFCFVQ-15-M Programming for Data Science (Autumn 2022) # UFCFVQ-15-M Programming for Data Science (Autumn 2022)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">OVERALL COURSEWORK MARK: ___%</p> <p style="color:red; font-weight:bold; font-size:xx-small">OVERALL COURSEWORK MARK: ___%</p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### GitLab link submission, README.md file and Git commit messages ### GitLab link submission, README.md file and Git commit messages
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Programming Task 1 # Programming Task 1
## Student Id: 05976423 ## Student Id: 05976423
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Requirement FR1 - Develop a function to find the arithmetic mean ### <font color = 'orange'>Requirement FR1</font> - Develop a function to find the arithmetic mean
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# list of numbers # list of numbers
list = [85, 29, 35, 55, 82, 45, 42, 21, 42, 60, 56, 30, 72, 56, 37, 65, 29, 14, 66, 43, 23, 39, 81, 56, 74, 29, 22, 27, 14, 66, 55, 33, 31, 66, 63, 41, 30, 48, 68, 58, 51, 44, 66, 34, 20, 71, 59, 57, 43, 48] list = [85, 29, 35, 55, 82, 45, 42, 21, 42, 60, 56, 30, 72, 56, 37, 65, 29, 14, 66, 43, 23, 39, 81, 56, 74, 29, 22, 27, 14, 66, 55, 33, 31, 66, 63, 41, 30, 48, 68, 58, 51, 44, 66, 34, 20, 71, 59, 57, 43, 48]
def FR1_mean(list): def FR1_mean(list):
''' '''
Function to calculate arithmetic mean - i.e. sum of data divided by number of data points. Function to calculate arithmetic mean - i.e. sum of data divided by number of data points.
''' '''
try: try:
#print(sum(list) / len(list)) #print(sum(list) / len(list))
return sum(list) / len(list) # sum of list divided by number of elements in list return sum(list) / len(list) # sum of list divided by number of elements in list
except ZeroDivisionError: except ZeroDivisionError:
print("Error: Division by zero. List is empty") print("Error: Division by zero. List is empty")
except TypeError: except TypeError:
print("Error: Invalid type in list. List must contain only numbers") print("Error: Invalid type in list. List must contain only numbers")
except: except:
print("Error with list of numbers. Please check list") print("Error with list of numbers. Please check list")
FR1_mean(list) FR1_mean(list)
``` ```
%% Output %% Output
47.62 47.62
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Requirement FR2 - Develop a function to read a single column from a CSV file ### <font color = 'orange'>Requirement FR2</font> - Develop a function to read a single column from a CSV file
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def FR2_data_from_column(filename, columnNum, delimiter = ',', header = True): def FR2_data_from_column(filename, columnNum, delimiter = ',', header = True):
''' '''
Function with two mandatory parameters (filename and column number (0 to n-1)) and two optional parameters (delimiter and header). The function will return a list of data from a specified column and the column name (if header is True). If header is False, the function will return only the list of data. Function with two mandatory parameters (filename and column number (0 to n-1)) and two optional parameters (delimiter and header). The function will return a list of data from a specified column and the column name (if header is True). If header is False, the function will return only the list of data.
''' '''
try: try:
with open(filename) as openFile: with open(filename) as openFile:
if header == True: if header == True:
variable = next(openFile).split(delimiter)[columnNum] variable = next(openFile).split(delimiter)[columnNum]
data = [line.split(delimiter)[columnNum] for line in openFile] data = [line.split(delimiter)[columnNum] for line in openFile]
return variable, data return variable, data
else: else:
return [line.split(delimiter)[columnNum] for line in openFile] return [line.split(delimiter)[columnNum] for line in openFile]
except FileNotFoundError: except FileNotFoundError:
print("Error: File not found. Please check file name, extension and path") print("Error: File not found. Please check file name, extension and path")
except: except:
print("Error with file. Please check file") print("Error with file. Please check file")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Test function specifying task1.csv and first column. Header is True by default, so data and variable (column name) are returned. # Test function specifying task1.csv and first column. Header is True by default, so data and variable (column name) are returned.
variable, data = FR2_data_from_column('task1.csv', 0) variable, data = FR2_data_from_column('task1.csv', 0)
print(variable) print(variable)
print(data[0:10]) # print first 10 elements of data list print(data[0:10]) # print first 10 elements of data list
``` ```
%% Output %% Output
age age
['16', '27', '26', '25', '29', '29', '22', '35', '44', '31'] ['16', '27', '26', '25', '29', '29', '22', '35', '44', '31']
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Requirement FR3 - Develop a function to read CSV data from a file into memory ### <font color = 'orange'>Requirement FR3</font> - Develop a function to read CSV data from a file into memory
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def FR3_read_csv_into_dictionary(filename, delimiter = ','): def FR3_read_csv_into_dictionary(filename, delimiter = ','):
''' '''
Function with one mandatory parameter (filename) to read csv file and return a dictionary with column names as keys and data as values. Default delimiter is comma. Assumes that the first line of the file contains the column names (i.e. header) which become the dictionary keys. Function with one mandatory parameter (filename) to read csv file and return a dictionary with column names as keys and data as values. Default delimiter is comma. Assumes that the first line of the file contains the column names (i.e. header) which become the dictionary keys.
''' '''
try: try:
with open(filename) as openFile: # open file with open(filename) as openFile: # open file
variable = next(openFile).split(delimiter) # read first line and split into list - to get dictionary keys variable = next(openFile).split(delimiter) # read first line and split into list - to get dictionary keys
data = [line.split(delimiter) for line in openFile] # read remaining lines and split into list of lists - to get corresponding dictionary values data = [line.split(delimiter) for line in openFile] # read remaining lines and split into list of lists - to get corresponding dictionary values
variable_data_dict = {variable[i]: [float(row[i]) for row in data] for i in range(len(variable))} # create dictionary with keys and values (as float) by iterating through variable list and data list variable_data_dict = {variable[i]: [float(row[i]) for row in data] for i in range(len(variable))} # create dictionary with keys and values (as float) by iterating through variable list and data list
return variable_data_dict return variable_data_dict
except FileNotFoundError: except FileNotFoundError:
print("Error: File not found. Please check file name, extension and path") print("Error: File not found. Please check file name, extension and path")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
my_dict = FR3_read_csv_into_dictionary('task1.csv') my_dict = FR3_read_csv_into_dictionary('task1.csv')
print(my_dict['age'][0:10]) # print first 10 elements of 'age' column print(my_dict['age'][0:10]) # print first 10 elements of 'age' column
``` ```
%% Output %% Output
[16.0, 27.0, 26.0, 25.0, 29.0, 29.0, 22.0, 35.0, 44.0, 31.0] [16.0, 27.0, 26.0, 25.0, 29.0, 29.0, 22.0, 35.0, 44.0, 31.0]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Requirement FR4 - Develop a function to calculate the Pearson Correlation Coefficient for two named columns ### <font color = 'orange'>Requirement FR4</font> - Develop a function to calculate the Pearson Correlation Coefficient for two named columns
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def FR4_pearsonCorrCoef(x, y): def FR4_pearsonCorrCoef(x, y):
''' '''
Function to calculate the Pearson Correlation Coefficient (PCC), often represented by the letter 'r'. PCC is a measure of linear correlation between two variables between -1 and 1. A value of 1 indicates a perfect positive linear relationship; a value of -1 indicates a perfect negative linear relationship; and a value of 0 indicates no linear relationship. Function to calculate the Pearson Correlation Coefficient (PCC), often represented by the letter 'r'. PCC is a measure of linear correlation between two variables between -1 and 1. A value of 1 indicates a perfect positive linear relationship; a value of -1 indicates a perfect negative linear relationship; and a value of 0 indicates no linear relationship.
The function takes two lists of numbers as input and returns a single value - the Pearson Correlation Coefficient. The function will return None if the lists are not the same length or if the lists contain non-numerical values. The function takes two lists of numbers as input and returns a single value - the Pearson Correlation Coefficient. The function will return None if the lists are not the same length or if the lists contain non-numerical values.
''' '''
# Check that x and y are lists of numbers of same length # Check that x and y are lists of numbers of same length
try: try:
assert type(x) == type([]) assert type(x) == type([])
assert type(y) == type([]) assert type(y) == type([])
assert len(x) == len(y) assert len(x) == len(y)
assert len(x) > 0 assert len(x) > 0
except AssertionError: except AssertionError:
print("Error: x and y MUST be same-length lists of only numbers in order to calculate Pearson's Correlation Coefficient") print("Error: x and y MUST be same-length lists of only numbers in order to calculate Pearson's Correlation Coefficient")
return None return None
# Calculate mean of x and y # Calculate mean of x and y
avg_x = FR1_mean(x) avg_x = FR1_mean(x)
avg_y = FR1_mean(y) avg_y = FR1_mean(y)
# Calculate standard deviation of x and y # Calculate standard deviation of x and y
stdx = (sum([(x-avg_x)**2 for x in x]) / len(x)) ** 0.5 stdx = (sum([(x-avg_x)**2 for x in x]) / len(x)) ** 0.5
stdy = (sum([(y-avg_y)**2 for y in y]) / len(y)) ** 0.5 stdy = (sum([(y-avg_y)**2 for y in y]) / len(y)) ** 0.5
# returns list of tuples with x, y and PCC values if required # returns list of tuples with x, y and PCC values if required
#PCCs = [(x[i] - avg_x) * (y[i] - avg_y) / (stdx * stdy) for i in range(len(x))] #PCCs = [(x[i] - avg_x) * (y[i] - avg_y) / (stdx * stdy) for i in range(len(x))]
#return [(x[i],y[i],PCCs[i]) for i in range(len(x))] #return [(x[i],y[i],PCCs[i]) for i in range(len(x))]
# Calculate Pearson Correlation Coefficient for lists x and y # Calculate Pearson Correlation Coefficient for lists x and y
r = FR1_mean([(x[i] - avg_x) * (y[i] - avg_y) for i in range(len(x))]) / (stdx * stdy) r = FR1_mean([(x[i] - avg_x) * (y[i] - avg_y) for i in range(len(x))]) / (stdx * stdy)
return r return r
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#testing FR4_pearsonCorrCoef function against numpy corrcoef #testing FR4_pearsonCorrCoef function against numpy corrcoef
x = [1, 2, 3, 5] x = [1, 2, 3, 5]
y = [1, 5, 7, 8] y = [1, 5, 7, 8]
print(FR4_pearsonCorrCoef(x, y)) print(FR4_pearsonCorrCoef(x, y))
import numpy as np import numpy as np
print(np.corrcoef(x, y)) print(np.corrcoef(x, y))
``` ```
%% Output %% Output
0.8984458631125747 0.8984458631125747
[[1. 0.89844586] [[1. 0.89844586]
[0.89844586 1. ]] [0.89844586 1. ]]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Requirement FR5 - Develop a function to generate a set of Pearson Correlation Coefficients for a given data file ### <font color = 'orange'>Requirement FR5</font> - Develop a function to generate a set of Pearson Correlation Coefficients for a given data file
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def FR5_PCCs_from_csv(filename): def FR5_PCCs_from_csv(filename):
''' '''
Function to calculate Pearson Correlation Coefficient (PCC) for all combinations of columns in a csv file, where each column is a variable, with column header as variable name. Function to calculate Pearson Correlation Coefficient (PCC) for all combinations of columns in a csv file, where each column is a variable, with column header as variable name.
''' '''
# Read csv file into variable as dictionary # Read csv file into variable as dictionary
my_dict = FR3_read_csv_into_dictionary(filename) my_dict = FR3_read_csv_into_dictionary(filename)
# Iterate through dictionary to calculate PCC for all combinations of variables, using FR4_pearsonCorrCoef function # Iterate through dictionary to calculate PCC for all combinations of variables, using FR4_pearsonCorrCoef function
PCC_list_of_tuples = [(variable, variable2, round(FR4_pearsonCorrCoef(my_dict[variable], my_dict[variable2]), 5)) PCC_list_of_tuples = [(variable, variable2, round(FR4_pearsonCorrCoef(my_dict[variable], my_dict[variable2]), 5))
for variable in my_dict for variable in my_dict
for variable2 in my_dict] for variable2 in my_dict]
#print(PCC_list_of_tuples) #print(PCC_list_of_tuples)
return PCC_list_of_tuples return PCC_list_of_tuples
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
FR5_output = FR5_PCCs_from_csv('task1.csv') FR5_output = FR5_PCCs_from_csv('task1.csv')
print(FR5_output) print(FR5_output)
``` ```
%% Output %% Output
[('age', 'age', 1.0), ('age', 'pop', -0.02671), ('age', 'share_white', 0.19961), ('age', 'share_black', -0.08807), ('age', 'share_hispanic', -0.13679), ('age', 'personal_income', 0.03248), ('age', 'household_income', 0.07123), ('age', 'poverty_rate', -0.11502), ('age', 'unemployment_rate', -0.08924), ('age', 'uni_education_25+\n', -0.01555), ('pop', 'age', -0.02671), ('pop', 'pop', 1.0), ('pop', 'share_white', 0.07551), ('pop', 'share_black', -0.1562), ('pop', 'share_hispanic', 0.06195), ('pop', 'personal_income', 0.20486), ('pop', 'household_income', 0.30517), ('pop', 'poverty_rate', -0.29133), ('pop', 'unemployment_rate', -0.21784), ('pop', 'uni_education_25+\n', 0.11698), ('share_white', 'age', 0.19961), ('share_white', 'pop', 0.07551), ('share_white', 'share_white', 1.0), ('share_white', 'share_black', -0.54497), ('share_white', 'share_hispanic', -0.57744), ('share_white', 'personal_income', 0.35839), ('share_white', 'household_income', 0.32212), ('share_white', 'poverty_rate', -0.49771), ('share_white', 'unemployment_rate', -0.38967), ('share_white', 'uni_education_25+\n', 0.33416), ('share_black', 'age', -0.08807), ('share_black', 'pop', -0.1562), ('share_black', 'share_white', -0.54497), ('share_black', 'share_black', 1.0), ('share_black', 'share_hispanic', -0.26242), ('share_black', 'personal_income', -0.28248), ('share_black', 'household_income', -0.34674), ('share_black', 'poverty_rate', 0.43067), ('share_black', 'unemployment_rate', 0.48363), ('share_black', 'uni_education_25+\n', -0.21296), ('share_hispanic', 'age', -0.13679), ('share_hispanic', 'pop', 0.06195), ('share_hispanic', 'share_white', -0.57744), ('share_hispanic', 'share_black', -0.26242), ('share_hispanic', 'share_hispanic', 1.0), ('share_hispanic', 'personal_income', -0.22313), ('share_hispanic', 'household_income', -0.13596), ('share_hispanic', 'poverty_rate', 0.20829), ('share_hispanic', 'unemployment_rate', 0.01475), ('share_hispanic', 'uni_education_25+\n', -0.29098), ('personal_income', 'age', 0.03248), ('personal_income', 'pop', 0.20486), ('personal_income', 'share_white', 0.35839), ('personal_income', 'share_black', -0.28248), ('personal_income', 'share_hispanic', -0.22313), ('personal_income', 'personal_income', 1.0), ('personal_income', 'household_income', 0.83196), ('personal_income', 'poverty_rate', -0.69592), ('personal_income', 'unemployment_rate', -0.50493), ('personal_income', 'uni_education_25+\n', 0.71661), ('household_income', 'age', 0.07123), ('household_income', 'pop', 0.30517), ('household_income', 'share_white', 0.32212), ('household_income', 'share_black', -0.34674), ('household_income', 'share_hispanic', -0.13596), ('household_income', 'personal_income', 0.83196), ('household_income', 'household_income', 1.0), ('household_income', 'poverty_rate', -0.75418), ('household_income', 'unemployment_rate', -0.51), ('household_income', 'uni_education_25+\n', 0.6729), ('poverty_rate', 'age', -0.11502), ('poverty_rate', 'pop', -0.29133), ('poverty_rate', 'share_white', -0.49771), ('poverty_rate', 'share_black', 0.43067), ('poverty_rate', 'share_hispanic', 0.20829), ('poverty_rate', 'personal_income', -0.69592), ('poverty_rate', 'household_income', -0.75418), ('poverty_rate', 'poverty_rate', 1.0), ('poverty_rate', 'unemployment_rate', 0.59169), ('poverty_rate', 'uni_education_25+\n', -0.46034), ('unemployment_rate', 'age', -0.08924), ('unemployment_rate', 'pop', -0.21784), ('unemployment_rate', 'share_white', -0.38967), ('unemployment_rate', 'share_black', 0.48363), ('unemployment_rate', 'share_hispanic', 0.01475), ('unemployment_rate', 'personal_income', -0.50493), ('unemployment_rate', 'household_income', -0.51), ('unemployment_rate', 'poverty_rate', 0.59169), ('unemployment_rate', 'unemployment_rate', 1.0), ('unemployment_rate', 'uni_education_25+\n', -0.46639), ('uni_education_25+\n', 'age', -0.01555), ('uni_education_25+\n', 'pop', 0.11698), ('uni_education_25+\n', 'share_white', 0.33416), ('uni_education_25+\n', 'share_black', -0.21296), ('uni_education_25+\n', 'share_hispanic', -0.29098), ('uni_education_25+\n', 'personal_income', 0.71661), ('uni_education_25+\n', 'household_income', 0.6729), ('uni_education_25+\n', 'poverty_rate', -0.46034), ('uni_education_25+\n', 'unemployment_rate', -0.46639), ('uni_education_25+\n', 'uni_education_25+\n', 1.0)] [('age', 'age', 1.0), ('age', 'pop', -0.02671), ('age', 'share_white', 0.19961), ('age', 'share_black', -0.08807), ('age', 'share_hispanic', -0.13679), ('age', 'personal_income', 0.03248), ('age', 'household_income', 0.07123), ('age', 'poverty_rate', -0.11502), ('age', 'unemployment_rate', -0.08924), ('age', 'uni_education_25+\n', -0.01555), ('pop', 'age', -0.02671), ('pop', 'pop', 1.0), ('pop', 'share_white', 0.07551), ('pop', 'share_black', -0.1562), ('pop', 'share_hispanic', 0.06195), ('pop', 'personal_income', 0.20486), ('pop', 'household_income', 0.30517), ('pop', 'poverty_rate', -0.29133), ('pop', 'unemployment_rate', -0.21784), ('pop', 'uni_education_25+\n', 0.11698), ('share_white', 'age', 0.19961), ('share_white', 'pop', 0.07551), ('share_white', 'share_white', 1.0), ('share_white', 'share_black', -0.54497), ('share_white', 'share_hispanic', -0.57744), ('share_white', 'personal_income', 0.35839), ('share_white', 'household_income', 0.32212), ('share_white', 'poverty_rate', -0.49771), ('share_white', 'unemployment_rate', -0.38967), ('share_white', 'uni_education_25+\n', 0.33416), ('share_black', 'age', -0.08807), ('share_black', 'pop', -0.1562), ('share_black', 'share_white', -0.54497), ('share_black', 'share_black', 1.0), ('share_black', 'share_hispanic', -0.26242), ('share_black', 'personal_income', -0.28248), ('share_black', 'household_income', -0.34674), ('share_black', 'poverty_rate', 0.43067), ('share_black', 'unemployment_rate', 0.48363), ('share_black', 'uni_education_25+\n', -0.21296), ('share_hispanic', 'age', -0.13679), ('share_hispanic', 'pop', 0.06195), ('share_hispanic', 'share_white', -0.57744), ('share_hispanic', 'share_black', -0.26242), ('share_hispanic', 'share_hispanic', 1.0), ('share_hispanic', 'personal_income', -0.22313), ('share_hispanic', 'household_income', -0.13596), ('share_hispanic', 'poverty_rate', 0.20829), ('share_hispanic', 'unemployment_rate', 0.01475), ('share_hispanic', 'uni_education_25+\n', -0.29098), ('personal_income', 'age', 0.03248), ('personal_income', 'pop', 0.20486), ('personal_income', 'share_white', 0.35839), ('personal_income', 'share_black', -0.28248), ('personal_income', 'share_hispanic', -0.22313), ('personal_income', 'personal_income', 1.0), ('personal_income', 'household_income', 0.83196), ('personal_income', 'poverty_rate', -0.69592), ('personal_income', 'unemployment_rate', -0.50493), ('personal_income', 'uni_education_25+\n', 0.71661), ('household_income', 'age', 0.07123), ('household_income', 'pop', 0.30517), ('household_income', 'share_white', 0.32212), ('household_income', 'share_black', -0.34674), ('household_income', 'share_hispanic', -0.13596), ('household_income', 'personal_income', 0.83196), ('household_income', 'household_income', 1.0), ('household_income', 'poverty_rate', -0.75418), ('household_income', 'unemployment_rate', -0.51), ('household_income', 'uni_education_25+\n', 0.6729), ('poverty_rate', 'age', -0.11502), ('poverty_rate', 'pop', -0.29133), ('poverty_rate', 'share_white', -0.49771), ('poverty_rate', 'share_black', 0.43067), ('poverty_rate', 'share_hispanic', 0.20829), ('poverty_rate', 'personal_income', -0.69592), ('poverty_rate', 'household_income', -0.75418), ('poverty_rate', 'poverty_rate', 1.0), ('poverty_rate', 'unemployment_rate', 0.59169), ('poverty_rate', 'uni_education_25+\n', -0.46034), ('unemployment_rate', 'age', -0.08924), ('unemployment_rate', 'pop', -0.21784), ('unemployment_rate', 'share_white', -0.38967), ('unemployment_rate', 'share_black', 0.48363), ('unemployment_rate', 'share_hispanic', 0.01475), ('unemployment_rate', 'personal_income', -0.50493), ('unemployment_rate', 'household_income', -0.51), ('unemployment_rate', 'poverty_rate', 0.59169), ('unemployment_rate', 'unemployment_rate', 1.0), ('unemployment_rate', 'uni_education_25+\n', -0.46639), ('uni_education_25+\n', 'age', -0.01555), ('uni_education_25+\n', 'pop', 0.11698), ('uni_education_25+\n', 'share_white', 0.33416), ('uni_education_25+\n', 'share_black', -0.21296), ('uni_education_25+\n', 'share_hispanic', -0.29098), ('uni_education_25+\n', 'personal_income', 0.71661), ('uni_education_25+\n', 'household_income', 0.6729), ('uni_education_25+\n', 'poverty_rate', -0.46034), ('uni_education_25+\n', 'unemployment_rate', -0.46639), ('uni_education_25+\n', 'uni_education_25+\n', 1.0)]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Requirement FR6 - Develop a function to print a custom table ### <font color = 'orange'>Requirement FR6</font> - Develop a function to print a custom table
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Function to calculate max column width, used in FR6 # Function to calculate max column width, used in FR6
def max_col_width(tup_list): def max_col_width(tup_list):
''' '''
Function to calculate the maximum column width for a list of tuples''' Function to calculate the maximum column width for a list of tuples'''
max_cols = 0 max_cols = 0
for row in tup_list: for row in tup_list:
max_cols = max(max_cols, len(row)) max_cols = max(max_cols, len(row))
col_widths = [0] * max_cols col_widths = [0] * max_cols
for row in tup_list: for row in tup_list:
for col, value in enumerate(row): for col, value in enumerate(row):
col_widths[col] = max(col_widths[col], len(str(value))) col_widths[col] = max(col_widths[col], len(str(value)))
return max(col_widths) return max(col_widths)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def FR6_print_table(tup_list, *col_headers, pad_char = '*'): def FR6_print_table(tup_list, *col_headers, pad_char = '*'):
''' '''
Function which takes a list of tuples, columns to include (as *arguments) and optional single padding character (defaulted to '*') as parameters. The padding character is used to create a table with a border. Function which takes a list of tuples, columns to include (as *arguments) and optional single padding character (defaulted to '*') as parameters. The padding character is used to create a table with a border.
''' '''
# if no column headers are provided, use all unique column headers from tup_list # if no column headers are provided, use all unique column headers from tup_list
if not col_headers: if not col_headers:
col_headers = sorted(set([x[0] for x in tup_list])) col_headers = sorted(set([x[0] for x in tup_list]))
else: else:
col_headers = col_headers col_headers = col_headers
# create list of unique row headers (same as cols) # create list of unique row headers (same as cols)
row_headers = col_headers row_headers = col_headers
# calculate maximum column width in the data # calculate maximum column width in the data
max_width = int(max_col_width(tup_list) * 1.9) max_width = int(max_col_width(tup_list) * 1.9)
# create table string with top border based on padding character and maximum column width # create table string with top border based on padding character and maximum column width
table_str = ' ' * int(max_width/2) + pad_char * (max_width * (len(col_headers))) + pad_char * (len(col_headers)+1)+ '\n' table_str = ' ' * int(max_width/2) + pad_char * (max_width * (len(col_headers))) + pad_char * (len(col_headers)+1)+ '\n'
table_str += ' ' * int(max_width/2) table_str += ' ' * int(max_width/2)
# add column headers to table string, using padding character and maximum column width # add column headers to table string, using padding character and maximum column width
for col in col_headers: for col in col_headers:
table_str += f"{col:^{(max_width)+1}}" table_str += f"{col:^{(max_width)+1}}"
table_str += '\n' table_str += '\n'
table_str += ' ' * int(max_width/2) + pad_char * (max_width * (len(col_headers))) + pad_char * (len(col_headers)+1)+'\n' table_str += ' ' * int(max_width/2) + pad_char * (max_width * (len(col_headers))) + pad_char * (len(col_headers)+1)+'\n'
# add row headers and values to table string, using padding character and maximum column width # add row headers and values to table string, using padding character and maximum column width
for row in row_headers: for row in row_headers:
table_str += f"{row:<{int(max_width/2)}}"+pad_char table_str += f"{row:<{int(max_width/2)}}"+pad_char
# Get the corresponding value (3rd element of tuple) for the current row and column; if no value, use '-' # Get the corresponding value (3rd element of tuple) for the current row and column; if no value, use '-'
for col in col_headers: for col in col_headers:
r_val = next((x[2] for x in tup_list if x[0] == col and x[1] == row), '-') r_val = next((x[2] for x in tup_list if x[0] == col and x[1] == row), '-')
# if value is positive, add a space to the left of the value to keep the table aligned # if value is positive, add a space to the left of the value to keep the table aligned
if r_val >= 0: if r_val >= 0:
table_str += f" {r_val:^{max_width-1}}" + pad_char table_str += f" {r_val:^{max_width-1}}" + pad_char
else: else:
table_str += f"{r_val:^{max_width}}" + pad_char table_str += f"{r_val:^{max_width}}" + pad_char
table_str += '\n' table_str += '\n'
# add bottom border to table string, using padding character and maximum column width # add bottom border to table string, using padding character and maximum column width
table_str += ' ' * int(max_width/2) + pad_char * max_width * len(col_headers) + pad_char * (len(col_headers)+1)+ '\n\n' table_str += ' ' * int(max_width/2) + pad_char * max_width * len(col_headers) + pad_char * (len(col_headers)+1)+ '\n\n'
# add caption for table # add caption for table
table_str += ' ' * int(max_width/2) + "Pearson's Correlation Coefficient for %s" % (col_headers,) table_str += ' ' * int(max_width/2) + "Pearson's Correlation Coefficient for %s" % (col_headers,)
# print table string # print table string
print(table_str) print(table_str)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
FR6_print_table(FR5_output, 'age', 'poverty_rate', 'household_income', pad_char = '-') FR6_print_table(FR5_output, 'age', 'poverty_rate', 'household_income', pad_char = '-')
``` ```
%% Output %% Output
---------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------
age poverty_rate household_income age poverty_rate household_income
---------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------
age - 1.0 - -0.11502 - 0.07123 - age - 1.0 - -0.11502 - 0.07123 -
poverty_rate - -0.11502 - 1.0 - -0.75418 - poverty_rate - -0.11502 - 1.0 - -0.75418 -
household_income - 0.07123 - -0.75418 - 1.0 - household_income - 0.07123 - -0.75418 - 1.0 -
---------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------
Pearson's Correlation Coefficient for ('age', 'poverty_rate', 'household_income') Pearson's Correlation Coefficient for ('age', 'poverty_rate', 'household_income')
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Coding Standards # Coding Standards
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Process Development Report for Task 1 # Process Development Report for Task 1
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### <b> Introduction </b> ### <font color = 'gold'><b> Introduction </b></font>
The purpose of this report is to provide a short, critical self-assessment of my code development process for Task 1 of the coursework for `UFCFVQ-15-M Programming_for_Data_Science`. The purpose of this report is to provide a short, critical self-assessment of my code development process for Task 1 of the coursework for `UFCFVQ-15-M Programming_for_Data_Science`.
### <b> Code Description </b> ### <b><font color = 'gold'> Code Description </font></b>
Task_1 requires writing functions in order, to ultimately calculate Pearson’s Correlation Coefficients (PCCs) for pairs of variables in a given data file, without using imported Python libraries, and printing a decent-looking table. Task_1 requires writing functions in order, to ultimately calculate Pearson’s Correlation Coefficients (PCCs) for pairs of variables in a given data file, without using imported Python libraries, and printing a decent-looking table.
Functional requirements (FRs): Functional requirements (FRs):
| FR | Description | | FR | Description |
|-----|-----------------------| |-----|-----------------------|
| FR1 | Arithmetic mean | | FR1 | Arithmetic mean |
| FR2 | Read column from file | | FR2 | Read column from file |
| FR3 | Read file | | FR3 | Read file |
| FR4 | PCC for two lists | | FR4 | PCC for two lists |
| FR5 | PCC for file | | FR5 | PCC for file |
| FR6 | Print table | | FR6 | Print table |
The code was developed in a Jupyter notebook using a Python 3.11 kernel. The code was developed in a Jupyter notebook using a Python 3.11 kernel.
### <b> Development Process</b> ### <b> <font color = 'gold'>Development Process</b></font>
My development process made use of the task’s inherent structure, allowing me to plan, develop and test each FR independently, before combining as needed. This was especially useful for more complex FRs, which required significant iteration and testing before achieving the desired results. My development process made use of the task’s inherent structure, allowing me to plan, develop and test each FR independently, before combining as needed. This was especially useful for more complex FRs, which required significant iteration and testing before achieving the desired results.
I used a modified crisp-dm approach, understanding the requirements, then cycling through iterations of pseudocode, Python code and testing until achieving the desired results. I found it very effective, but also that I can occasionally go “off-piste” in the iterations, which can be time-consuming, frustrating and ultimately less productive. I used a modified crisp-dm approach, understanding the requirements, then cycling through iterations of pseudocode, Python code and testing until achieving the desired results. I found it very effective, but also that I can occasionally go “off-piste” in the iterations, which can be time-consuming, frustrating and ultimately less productive.
![](2022-12-18-23-11-02.png) ![](2022-12-18-23-11-02.png)
I made conscious use of “new-to-me” tools and techniques like Git, VS_Code, Jupyter notebooks, Markdown. I made conscious use of “new-to-me” tools and techniques like Git, VS_Code, Jupyter notebooks, Markdown.
### <b> Code Evaluation </b> ### <b> <font color = 'gold'>Code Evaluation </b></font>
Overall, I am pleased with my code - functions achieve the requirements (as interpreted) and they <i>feel</i> efficient and robust. Overall, I am pleased with my code - functions achieve the requirements (as interpreted) and they <i>feel</i> efficient and robust.
Principles in mind when writing functions: Principles in mind when writing functions:
* Future-proofed: generic, flexible, adaptable to allow reusability * Future-proofed: generic, flexible, adaptable to allow reusability
* User-friendly, by adding assertions and error-handling * User-friendly, by adding assertions and error-handling
* Unambiguous, self-explanatory naming of functions and variables * Unambiguous, self-explanatory naming of functions and variables
* Helpful comments/docstrings by balancing approaches like DRY (Don’t Repeat Yourself), WET (Write Everything Twice), KISS (Keep it Simple, Stupid) * Helpful comments/docstrings by balancing approaches like DRY (Don’t Repeat Yourself), WET (Write Everything Twice), KISS (Keep it Simple, Stupid)
#### <b> Strengths </b> #### <b><font color = 'gold'> Strengths</font> </b>
* Well-commented, functioning code * Well-commented, functioning code
* Consistent Git use for version control * Consistent Git use for version control
* Kept working notes * Kept working notes
#### <b> Improvements / To-do </b> #### <b> <font color = 'gold'>Improvements / To-do </font></b>
* Perhaps over-commented; erred on side of caution * Perhaps over-commented; erred on side of caution
* Establish preferred naming convention – camelCase, snake_case * Establish preferred naming convention – camelCase, snake_case
* Learn Python conventions * Learn Python conventions
* Don’t get side-tracked when testing * Don’t get side-tracked when testing
* Update pseudo code * Update pseudo code
[Archived reflective notes by task](archived\Task1_FR_reflections.md) [Archived reflective notes by task](archived\Task1_FR_reflections.md)
#### <b> Summary </b> #### <b> <font color = 'gold'>Summary</font> </b>
I found this task both appealing and beneficial. It allowed me to build a useful function from the ground up, making use of different Python coding techniques and data structures whilst also employing version control and applying appropriate metadata to the code. I found this task both appealing and beneficial. It allowed me to build a useful function from the ground up, making use of different Python coding techniques and data structures whilst also employing version control and applying appropriate metadata to the code.
I am super-keen to keep learning for my personal and professional development, picking up best practice, standard approaches and avoiding pitfalls. This task allowed me to practice all of this. I am super-keen to keep learning for my personal and professional development, picking up best practice, standard approaches and avoiding pitfalls. This task allowed me to practice all of this.
When it comes to Python, I am amazed at the many possibilities of solving the same scenario – this can make it challenging to identify the ‘best approach,’ if it exists. This is something I will need to get used to and embrace. When it comes to Python, I am amazed at the many possibilities of solving the same scenario – this can make it challenging to identify the ‘best approach,’ if it exists. This is something I will need to get used to and embrace.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p> <p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p> <p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment