diff --git a/.DS_Store b/.DS_Store index 3d87cccbdea95b096307a99f5b8ed77a23c1e8ad..201306903d594c5815e4f6d32f437ef9cf45f97d 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/.gitingore b/.gitignore similarity index 100% rename from .gitingore rename to .gitignore diff --git a/Activity1/.DS_Store b/Activity1/.DS_Store index 793481ebcccf36d2fa2d0669735671d9392a39df..97a85b5ad211acd81c2c0a7d9a8772405a0c2638 100644 Binary files a/Activity1/.DS_Store and b/Activity1/.DS_Store differ diff --git a/Activity1/ACTIVITY1.md b/Activity1/ACTIVITY1.md new file mode 100644 index 0000000000000000000000000000000000000000..20e47b7fc6ed3b199a11b6145a041b100c2c053a --- /dev/null +++ b/Activity1/ACTIVITY1.md @@ -0,0 +1,366 @@ +# Advanced Databases UFCFU3-15-3 Activity 1 +## Activity 1 [Design](#design), [Implementation](#implementation), & [Queries](#queries) + + +## Technical Specification + +This section outlines the technologies implemented in the development of Activity 1, an SQL-based database management system for a football agent managing their clients. + +### Visual Paradigm + +**Category:** Modelling and Design Tool + +**Rationale for Selection:** VP’s comprehensive design and modelling features, supporting Entity Relationship Diagrams. + +**Role in the Project:** Utilised for the design of the database. +# +### Python + +**Category:** Programming Language + +**Rationale for Selection:** Simplicity, versatility, and extensive support of data manipulation and database interaction libraries. + +**Role in the Project:** As the core programming language for data imports, scripting, and automating data operations. +# +### PostgreSQL + +**Category:** Database Management System + +**Rationale for Selection:** Reliability, scalability, and strong feature set, including support for complex queries and transactional integrity. + +**Role in the Project:** Serves as the foundational database engine, responsible for the secure storage, retrieval, and integrity of data. +# +### psycopg2 + +**Category:** PostgreSQL Database Adapter + +**Rationale for Selection:** High performance, full features, and compatibility with the PostgreSQL database. + +**Role in the Project:** Facilitates direct interaction with the PostgreSQL database, enabling efficient execution of SQL queries, data insertion, and retrieval. +# +### SQLAlchemy + +**Category:** Toolkit and Object-Relational Mapping (ORM) Library + +**Rationale for Selection:** Suite of tools for SQL database interaction using Python. + +**Role in the Project:** Manages database schema and operations through ORM, allowing for streamlined interaction with PostgreSQL. +# +### pandas + +**Category:** Data Manipulation Library + +**Rationale for Selection:** Powerful data structures and functions for efficient data manipulation and analysis. + +**Role in the Project:** Utilised for data cleaning, transformation, and analysis, facilitating data preprocessing. + +## Design + +### Entity Relationship Diagram (ERD) + + +## Implementation + +The implementation of Activity 1 consists of setting up a robust database system focusing on optimisation, data integrity, and efficient data management and leveraging the power of SQLAlchemy for object-relational mapping and PostgreSQL as the database management system, following the well-structured database schema from the previous section to ensure data normalisation and integrity. The project adopts a modular approach for configuration and database setup, ensuring scalability and maintainability. Comprehensive scripts for data importation and normalisation were implemented, employing pandas for data manipulation and cleaning, thus streamlining the process of integrating and processing player data into the database. A modular approach was taken to enhance scalability, maintainability, and the overall efficiency of the system. + +### Activity 1 Directory Tree: +``` +Activity1 +├── manage.py +└── src + ├── __init__.py + ├── config.py + ├── data_import.py + ├── database_setup.py + ├── db.py + ├── drop_database.py + ├── models.py + ├── normalisation.py + └── queries.py +data +└── players_data.csv +``` + +### *[config.py](src/config.py)* : + +Contains a single configuration variable, DATABASE_URI, which is used to define the connection string for a PostgreSQL database. This includes the database system (PostgreSQL), the use of psycopg2 as the database adapter. + +### *[db.py](src/db.py)* : + +Crucial for the database system's operation, facilitating database connection and table creation. It establishes a database connection using SQLAlchemy's create_engine function, utilising the DATABASE_URI from config.py. `create_all_tables()` uses the Base object from models.py, calling Base.metadata.create_all(engine) to create all tables defined in the models module according to the metadata. + + +### *[database_setup.py](src/database_setup.py)* : + +Responsible for initiating the database setup, specifically for creating tables. It does this by importing a function named create_all_tables from the db module and calling this function. After successfully creating the tables, it prints a confirmation message. + + +### *[data_import.py](src/data_import.py)* : + +Imports data into the database, leveraging SQLAlchemy for database interactions and pandas for data manipulation. It uses SQLAlchemy's sessionmaker to create a database session bound to the engine provided by `db.get_engine()`. This session facilitates transactions with the database. `import_data()` accepts a session and pandas DataFrames (players_df, teams_df, contracts_df) as input. DataFrames are appended to the corresponding tables in the database using the to_sql method, with a multi method to optimise the insertion process. It reads the teams and players tables back into pandas DataFrames to get the database-assigned IDs. Merges the contracts_df DataFrame with teams and players DataFrames on names to replace the names with their respective IDs. Cleans the contracts_df DataFrame by dropping unnecessary columns and renaming ID columns to player_id and team_id. Inserts the cleaned contracts_df DataFrame into the contracts table in the database. The session is rolled back in case of an exception during the data import process. The main function creates a session, reads and normalises data (presumably from a CSV file) using read_and_normalise_data function from normalisation.py, and then imports the data using import_data. + +### *[normalisation.py](src/normalisation.py)* : + +Handles data normalisation and preparation for database import. `read_and_normalise_data()` serves as the primary mechanism for reading data from a CSV file and transforming it into a format suitable for database insertion. It employs pandas to read the CSV file, specifying windows-1252 encoding to accommodate special characters. Data is then segmented into three separate DataFrames: + +`teams_df`: Extracts team-related information such as team name, location, and manager. It removes duplicate entries and renames columns to match the database schema (name, location, manager). + +`players_df`: Focuses on player-specific information including name, date of birth, gender, and the date they signed up. Ensures uniqueness by considering both name and date of birth, and formats dates to a standard SQL format (YYYY-MM-DD). This DataFrame also handles potential date parsing errors by setting invalid or misformatted dates to null. + +`contracts_df`: Extracts contract details like player name, team name, salary, start date, contract duration, and commission percentage. Converts salary to a numeric format, handling string to float conversion and multiplying by 1000 for correct scale. Dates are normalised similarly to `players_df`. + +After processing, data frames are filled with default values or corrected to prevent integrity issues when inserting into the database. Exception handling is implemented to manage errors during file reading or data processing, with feedback provided via console messages. + +### *[models.py](src/models.py)* : + +Defines the ORM (Object-Relational Mapping) models for the SQLAlchemy-based database schema, using a declarative base provided by SQLAlchemy's `declarative_base()` function. This module outlines the structure of the database tables through the following classes: + +`Player`: Maps to the players table with fields for player ID, name, date of birth, gender, and the date they signed up. It includes unique and non-unique indices to enhance query performance and enforce data uniqueness on critical attributes. + +`Team`: Represents the teams table, containing fields for team ID, name, location, and manager. A unique constraint combines name, location, and manager to prevent duplicate entries, ensuring that each team is uniquely identified by these three attributes. + +`Contract`: Corresponds to the contracts table, linking players to teams via foreign keys (`player_id` and `team_id`). It includes fields for contract specifics such as salary per week, start date, duration in years, and commission percentage. Indices on foreign keys and start date facilitate efficient querying, particularly for relationships and contract management. + +Relationships between tables are defined using SQLAlchemy's relationship function, allowing seamless navigation between connected data points. This setup ensures that the database schema is not only optimised for performance but also adheres to best practices in database design. + +### *[drop_database.py](src/drop_database.py)* : + +Manages the removal of database tables, specifically designed to facilitate a clean slate by dropping the contracts, players, and teams tables from the database. The module utilises SQLAlchemy's engine and connection methods to directly interact with the database. + +## Queries + +The *[queries.py](src/queries.py)* file serves as a central script for executing the 5 required SQL queries leveraging SQLAlchemy to execute and manage database queries securely. + +- Database Engine Access: + - The file imports `get_engine` from the *[db.py](src/db.py)* module, which provides the necessary database engine configured for accessing the database. +- Query Execution: + - Each function within this file corresponds to a query that targets specific data retrieval from the database. + - SQL queries are written in a textual format using SQLAlchemy's text function to ensure that they are safely constructed and executed against the database. This prevents SQL injection risks and enhances security. + - The engine connection is opened and managed using a context manager (with statement), ensuring that connections are properly closed after executing the queries, managing resources efficiently. +- Data Retrieval and Processing: + - The results of each query are fetched and processed within the same function, converting the raw results into a list of names before returning. +- Main Execution Block: + - The file contains an executable block that, when run directly, executes all the defined functions and prints out their results. + +## Query Implementations + +#### 1. List all the players that have a contract expiring within the next 12 months + +``` +def list_players_contract_expiring_next_12_months(): + + query = text(""" + SELECT p.name + FROM players p + JOIN contracts c ON p.id = c.player_id + WHERE c.start_date + (c.duration_years * INTERVAL '1 year') < (CURRENT_DATE + INTERVAL '1 year') + AND c.start_date + (c.duration_years * INTERVAL '1 year') > CURRENT_DATE + """) + + engine = get_engine() + with engine.connect() as connection: + result = connection.execute(query) + return [row[0] for row in result.fetchall()] +``` + +1. **SELECT Clause**: + - `SELECT p.name`: Specifies that the output should include the name of the player. It fetches the `name` field from the `players` table, which is aliased as `p`. +2. **FROM Clause**: + - `FROM players p`: This specifies the primary table, `players`, from which the data is to be retrieved. The table is given the alias p to simplify referencing its columns in other parts of the query. +3. **JOIN Clause**: + - `JOIN contracts c ON p.id = c.player_id`: This joins the `players` table with the `contracts` table, which is aliased as c. The join condition `p.id = c.player_id` ensures that the data is merged based on the player's ID matching the player ID stored in the contracts. This is a typical foreign key relationship, linking players to their contracts. +4. **WHERE Clause**: + - The `WHERE` clause filters records based on two conditions concerning the contract's start date and duration: + - `c.start_date + (c.duration_years * INTERVAL '1 year') < (CURRENT_DATE + INTERVAL '1 year')`: This condition checks if the end date of the contract is before the date one year from the current date. It calculates the end date by adding the duration of the contract, expressed in years, to the start date of the contract. + - `c.start_date + (c.duration_years * INTERVAL '1 year') > CURRENT_DATE`: This ensures that the calculated end date of the contract is also after the current date. This condition is crucial to avoid selecting contracts that have already expired. +5. **Calculation of Contract Expiry**: + - The key to this query is calculating the contract's expiry date with `c.start_date + (c.duration_years * INTERVAL '1 year')`. This takes the start date of the contract and adds the duration in years. The use of the `INTERVAL '1 year'` is crucial for correctly interpreting the addition in terms of years rather than some other unit. +6. **Logical Flow**: + - The query effectively filters for players whose contracts will expire within the next year but have not yet expired. This is particularly useful for management or coaching staff to review upcoming contract renewals or negotiations. + +In summary, this query helps in identifying players who are nearing the end of their contractual commitments within the upcoming year, providing a proactive tool for managing player contracts efficiently. + +#### 2. List all the female players that have had at least three years of service with the same team + +``` +def list_female_players_three_years_service(): + query = text(""" + SELECT p.name + FROM players p + JOIN contracts c ON p.id = c.player_id + WHERE p.gender = 'F' + AND EXTRACT(year FROM age(CURRENT_DATE, c.start_date)) >= 3 + GROUP BY p.name + HAVING COUNT(DISTINCT c.team_id) >= 1 + """) + + engine = get_engine() + with engine.connect() as connection: + result = connection.execute(query) + return [row[0] for row in result.fetchall()] +``` + +1. **SELECT Clause**: + - `SELECT p.name`: Specifies that the output should include the name of the player. It fetches the `name` field from the `players` table, which is aliased as `p`. + +2. **FROM Clause**: + - `FROM players p`: This specifies the primary table, `players`, from which the data is to be retrieved. The table is given the alias `p` to simplify referencing its columns in other parts of the query. + +3. **JOIN Clause**: + - `JOIN contracts c ON p.id = c.player_id`: This joins the `players` table with the `contracts` table, which is aliased as `c`. The join condition `p.id = c.player_id` ensures that the data is merged based on the player's ID matching the player ID stored in the contracts. This is a typical foreign key relationship, linking players to their contracts. + +4. **WHERE Clause**: + - The `WHERE` clause filters records based on gender and the duration of service from the start date of their first contract: + - `p.gender = 'F'`: Filters to include only female players. + - `EXTRACT(year FROM age(CURRENT_DATE, c.start_date)) >= 3`: Uses the `EXTRACT` function to calculate the number of years from the contract start date to the current date, ensuring that it is at least three years. This is calculated using the `age` function which returns an interval representing the time between the current date and the contract start date. + +5. **GROUP BY Clause**: + - `GROUP BY p.name`: Groups the results by player name. This is necessary for the aggregate function used in the `HAVING` clause to evaluate conditions on a per-player basis. + +6. **HAVING Clause**: + - `HAVING COUNT(DISTINCT c.team_id) >= 1`: Ensures that the grouped results (individual players) have been associated with at least one distinct team ID in the contracts table. This clause is used to filter out any players who do not meet this criteria after the initial group by operation. + +In summary, this query identifies female players who have been with the organization for at least three years, regardless of whether they have switched teams within the organization. It ensures that only those who meet both the criteria of being with the organization for three years and having played in at least one team are included in the results. + +#### 3. List the top 5 players in terms of income generation for the agency + +``` +def list_top_5_income_generating_players(): + query = text(""" + SELECT p.name, SUM(c.salary_per_week * 52 * c.duration_years * c.commission_percentage / 100) AS income + FROM players p + JOIN contracts c ON p.id = c.player_id + GROUP BY p.name + ORDER BY income DESC + LIMIT 5 + """) + + engine = get_engine() + with engine.connect() as connection: + result = connection.execute(query) + return [row[0] for row in result.fetchall()] +``` + +1. **SELECT Clause**: + - `SELECT p.name, SUM(c.salary_per_week * 52 * c.duration_years * c.commission_percentage / 100) AS income`: Specifies that the output should include the name of the player and their total income calculated from their contracts. The income is calculated by multiplying the weekly salary (`salary_per_week`) by 52 (the number of weeks in a year), the duration of the contract in years (`duration_years`), and the commission percentage (`commission_percentage`), with the result divided by 100 to adjust for the percentage. + +2. **FROM Clause**: + - `FROM players p`: This specifies the primary table, `players`, from which the data about players is to be retrieved. The table is given the alias `p` to simplify referencing its columns in other parts of the query. + +3. **JOIN Clause**: + - `JOIN contracts c ON p.id = c.player_id`: This joins the `players` table with the `contracts` table, which is aliased as `c`. The join condition `p.id = c.player_id` ensures that the data is merged based on the player's ID matching the player ID stored in the contracts. This is a typical foreign key relationship, linking players to their financial contracts. + +4. **GROUP BY Clause**: + - `GROUP BY p.name`: Groups the results by player name. This grouping is essential for the aggregate function (`SUM`) used in the `SELECT` clause to calculate total income per player. + +5. **ORDER BY Clause**: + - `ORDER BY income DESC`: Orders the results by the calculated income in descending order. This means the players with the highest calculated income appear first in the result set. + +6. **LIMIT Clause**: + - `LIMIT 5`: Limits the results to the top 5 entries. This is used to fetch only the top five players based on the calculated income. + +In summary, this query identifies the top 5 earning players based on the sum of their contracted income over the duration of their contracts, taking into account their weekly salary, the number of years, and their commission percentage. It provides a powerful tool for assessing the financial impact of players in terms of their contractual agreements. + +#### 4. List the top 5 players (irrespective of gender) that have the longest association with the agency + +``` +def list_top_5_longest_associated_players(): + query = text(""" + SELECT p.name, MIN(p.date_signed_up) AS signup_date + FROM players p + GROUP BY p.name + ORDER BY signup_date + LIMIT 5 + """) + + engine = get_engine() + with engine.connect() as connection: + result = connection.execute(query) + return [row[0] for row in result.fetchall()] +``` + +1. **SELECT Clause**: + - `SELECT p.name, MIN(p.date_signed_up) AS signup_date`: Specifies that the output should include the name of the player and the earliest signup date among the entries for each player. The `MIN` function is used to find the minimum date on which each player signed up, providing insight into their initial engagement with the organization. + +2. **FROM Clause**: + - `FROM players p`: This specifies the primary table, `players`, from which the data about players is to be retrieved. The table is given the alias `p` to simplify referencing its columns in other parts of the query. + +3. **GROUP BY Clause**: + - `GROUP BY p.name`: Groups the results by player name. This is necessary because the aggregate function `MIN` is used in the `SELECT` clause, which requires grouping to determine the earliest signup date for each player. + +4. **ORDER BY Clause**: + - `ORDER BY signup_date`: Orders the results by the earliest signup date in ascending order. This sorting allows us to see which players were the first to sign up. + +5. **LIMIT Clause**: + - `LIMIT 5`: Limits the results to the top 5 entries. This is used to fetch only the first five players who signed up with the organization. + +In summary, this query identifies the first five players who signed up with the organization by determining the earliest signup dates among all entries. It provides valuable insights into the longevity and early adopters within the player roster. + + + +#### 5. Show those players that are at risk of no contract renewal, i.e. less than six months contract remaining + +``` +def list_players_at_risk_no_contract_renewal(): + query = text(""" + SELECT p.name + FROM players p + JOIN contracts c ON p.id = c.player_id + WHERE c.start_date + (c.duration_years * INTERVAL '1 year') < (CURRENT_DATE + INTERVAL '6 months') + AND c.start_date + (c.duration_years * INTERVAL '1 year') > CURRENT_DATE + """) + + engine = get_engine() + with engine.connect() as connection: + result = connection.execute(query) + return [row[0] for row in result.fetchall()] +``` + +1. **SELECT Clause**: + - `SELECT p.name`: Specifies that the output should include the name of the player. It fetches the `name` field from the `players` table, which is aliased as `p`. + +2. **FROM Clause**: + - `FROM players p`: This specifies the primary table, `players`, from which the data is to be retrieved. The table is given the alias `p` to simplify referencing its columns in other parts of the query. + +3. **JOIN Clause**: + - `JOIN contracts c ON p.id = c.player_id`: This joins the `players` table with the `contracts` table, which is aliased as `c`. The join condition `p.id = c.player_id` ensures that the data is merged based on the player's ID matching the player ID stored in the contracts. This is a typical foreign key relationship, linking players to their contracts. + +4. **WHERE Clause**: + - The `WHERE` clause filters records based on the contract's end date relative to the current date and a six-month future window: + - `c.start_date + (c.duration_years * INTERVAL '1 year') < (CURRENT_DATE + INTERVAL '6 months')`: This condition checks if the end date of the contract, calculated by adding the duration of the contract in years to the start date, is before the date six months from now. + - `c.start_date + (c.duration_years * INTERVAL '1 year') > CURRENT_DATE`: This condition ensures that the calculated end date of the contract is also after the current date. This filters out any contracts that have already expired. + +In summary, this query identifies players whose contracts are currently active but will expire within the next six months. It is useful for flagging contracts that need attention for renewal discussions or termination preparations. + +## Query Results + +``` +Players with contracts expiring in the next 12 months: +Name: OB001 +Name: OB022 +Name: OG005 +Name: NG032 +Name: NG222 + +Female players with at least three years of service with the same team: +Name: NG032 +Name: OG005 + +Top 5 income-generating players: +Name: NB009 +Name: NB311 +Name: OB124 +Name: OB022 +Name: NB212 + +Top 5 longest associated players: +Name: OG005 +Name: NG032 +Name: OB022 +Name: NB337 +Name: NG001 + +Players at risk of no contract renewal: +Name: OB022 +Name: NG032 +``` \ No newline at end of file diff --git a/Activity1/images/Football-Agent.png b/Activity1/images/Football-Agent.png new file mode 100644 index 0000000000000000000000000000000000000000..566dd61c83fc317ad0e52bb58e4405ad0768ec19 Binary files /dev/null and b/Activity1/images/Football-Agent.png differ diff --git a/Activity1/src/.DS_Store b/Activity1/src/.DS_Store index 3a3c7b2fef2fb11f1f292c9007c86b7f74ecf3f5..e6b86e05d14cf599624456d4b115dfe74b5e3509 100644 Binary files a/Activity1/src/.DS_Store and b/Activity1/src/.DS_Store differ diff --git a/Activity2/ACTIVITY2.md b/Activity2/ACTIVITY2.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/README.md b/README.md index d07d3cc82503a4178b92e27cebe0f00aeb4cc3a0..94d286c5b1aeb31dddc0549b8e191c9b92596480 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Advanced Databases Activity 1 & 2 +# Advanced Databases UFCFU3-15-3 Activity 1 & 2 ## Description @@ -27,13 +27,26 @@ This project uses both PostgreSQL and MongoDB databases. Follow the steps below 3. Update the `DATABASE_URI` configuration in [Activity2/src/config.py](Activity2/src/config.py) with your database name and host. ## Database Setup -1. Run the project using `python manage.py` +1. Run the project using `python manage.py` from the root directory. -## Activities -The project contains the following activities: - -- Activity1: SQL Implementation and required queries -- Activity2: NoSQL Implementation and required queries ## Data The project uses data from `players_data.csv` which contains all data regarding players, teams, and contracts. + + + +## Activities + +### Activity 1: (SQL) + +#### Contents: +SQL implementation and required queries. + +**[Activity 1 Implementation Documentation](Activity1/ACTIVITY1.md)** + +### Activity 2: (NoSQL) + +#### Contents: +NoSQL implementation and required queries. + +**[Activity 2 Implementation Documentation](Activity2/ACTIVITY2.md)** \ No newline at end of file