Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Overview

PremiershipPlayerAnalysis

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. Note : My understanding is the squad data on this site can change at any time so your results might be different

Improvement : Calculate age to finer degree than just years

The was developed in Jupyter Notebook and this walkthrough willl assume you are doing the same

Once you have ran the scraping

original = pd.DataFrame(playersList) # Convert the data scraped into a Pandas DataFrame 

original.to_csv('premiershipplayers.csv') # Keep a back up of the data to save time later if required 

df2 = original.copy() # Working copy of the DataFrame (just in case) 


df2.info()


   
    
RangeIndex: 578 entries, 0 to 577
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   club         578 non-null    object
 1   name         578 non-null    object
 2   shirtNo      572 non-null    object
 3   nationality  562 non-null    object
 4   dob          562 non-null    object
 5   height       500 non-null    object
 6   weight       474 non-null    object
 7   appearances  578 non-null    object
 8   goals        578 non-null    object
 9   wins         578 non-null    object
 10  losses       578 non-null    object
dtypes: object(11)
memory usage: 49.8+ KB

   

*** A total of 578 player. ***

6 without shirt number

16 without nationality listed

16 without dob listed

78 without height listed

104 without weight listed

Cleanup Data

  1. Remove spaces and newline from dob, appearances, goals, wins and losses columns

  2. Change type of dob to date

  3. change type of appearances, goals, wins, losses to int

     df2['dob'] = df2['dob'].str.replace('\n','').str.strip(' ')
     df2['appearances'] = df2['appearances'].str.replace('\n','').str.strip(' ')
     df2['goals'] = df2['goals'].str.replace('\n','').str.strip(' ')
     df2['wins'] = df2['wins'].str.replace('\n','').str.strip(' ')
     df2['losses'] = df2['losses'].str.replace('\n','').str.strip(' ')
    
     # change type of dob, appearances, goals, wins, losses
     from datetime import  date
    
     df2['dob'] = pd.to_datetime(df2['dob'],format='%d/%m/%Y').dt.date
     df2["appearances"] = pd.to_numeric(df2["appearances"])
     df2["goals"] = pd.to_numeric(df2["goals"])
     df2["wins"] = pd.to_numeric(df2["wins"])
     df2["losses"] = pd.to_numeric(df2["losses"])
     df2['height'] = df2['height'].str[:-2]
     df2["height"] = pd.to_numeric(df2["height"])
     
     
     # Create age column
    
     today = date.today()
    
     def age(born):
         if born:
             return today.year - born.year - ((today.month, 
                                           today.day) < (born.month, 
                                                         born.day))
         else:
             return np.nan
    
     df2['age'] = df2['dob'].apply(age)
    

10 Oldest Players

    df2.sort_values('age',ascending=False).head(10)

image

10 Youngest Players

    df2.sort_values('age',ascending=True).head(10)

image

Squad Sizes

    df2.groupby(['club'])['club'].count().sort_values(ascending=False)

image

Team's Average Player Age

    plt.ylim([20, 30])
    df2.groupby(['club'])['age'].mean().sort_values(ascending=False).plot.bar()

image

Burnley appear to not only have one of the highest average player ages but also the owest number of registered players

Top 10 Premiership Appearances

    df2.sort_values('appearances',ascending=False).head(10)

image

Collective Premiership Appearances per Club

    df2.groupby(['club'])['appearances'].sum().sort_values(ascending=False)

image

    df2.groupby(['club'])['appearances'].sum().sort_values(ascending=False).plot.bar()

image

10 Tallest Playes

    df2.sort_values('height',ascending=False).head(10)

image

10 Shortest Playes

    df2.sort_values('height',ascending=True).head(10)

image

Nationality totals of Players

    pd.set_option('display.max_rows', 100)
    df.groupby(['nationality'])['club'].count().sort_values(ascending=False)

Nationality totals per club

    pd.set_option('display.max_rows', 500)
    df.groupby(['club','nationality'])['nationality'].count()
Additional tools for particle accelerator data analysis and machine information

PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au

PyLHC 3 Apr 13, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

Nico Van den Hooff 17 Aug 21, 2022
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 09, 2021
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
Maximum Covariance Analysis in Python

xMCA | Maximum Covariance Analysis in Python The aim of this package is to provide a flexible tool for the climate science community to perform Maximu

Niclas Rieger 39 Jan 03, 2023
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022
Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

Glotzer Group 195 Dec 20, 2022
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022