Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Last update: Sep 06, 2021

Related tags

Data Analysis PremiershipPlayerAnalysis

Overview

PremiershipPlayerAnalysis

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. Note : My understanding is the squad data on this site can change at any time so your results might be different

Improvement : Calculate age to finer degree than just years

The was developed in Jupyter Notebook and this walkthrough willl assume you are doing the same

Once you have ran the scraping

original = pd.DataFrame(playersList) # Convert the data scraped into a Pandas DataFrame 

original.to_csv('premiershipplayers.csv') # Keep a back up of the data to save time later if required 

df2 = original.copy() # Working copy of the DataFrame (just in case) 


df2.info()


   
    
RangeIndex: 578 entries, 0 to 577
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   club         578 non-null    object
 1   name         578 non-null    object
 2   shirtNo      572 non-null    object
 3   nationality  562 non-null    object
 4   dob          562 non-null    object
 5   height       500 non-null    object
 6   weight       474 non-null    object
 7   appearances  578 non-null    object
 8   goals        578 non-null    object
 9   wins         578 non-null    object
 10  losses       578 non-null    object
dtypes: object(11)
memory usage: 49.8+ KB

*** A total of 578 player. ***

6 without shirt number

16 without nationality listed

16 without dob listed

78 without height listed

104 without weight listed

Cleanup Data

Remove spaces and newline from dob, appearances, goals, wins and losses columns
Change type of dob to date

change type of appearances, goals, wins, losses to int

 df2['dob'] = df2['dob'].str.replace('\n','').str.strip(' ')
 df2['appearances'] = df2['appearances'].str.replace('\n','').str.strip(' ')
 df2['goals'] = df2['goals'].str.replace('\n','').str.strip(' ')
 df2['wins'] = df2['wins'].str.replace('\n','').str.strip(' ')
 df2['losses'] = df2['losses'].str.replace('\n','').str.strip(' ')

 # change type of dob, appearances, goals, wins, losses
 from datetime import  date

 df2['dob'] = pd.to_datetime(df2['dob'],format='%d/%m/%Y').dt.date
 df2["appearances"] = pd.to_numeric(df2["appearances"])
 df2["goals"] = pd.to_numeric(df2["goals"])
 df2["wins"] = pd.to_numeric(df2["wins"])
 df2["losses"] = pd.to_numeric(df2["losses"])
 df2['height'] = df2['height'].str[:-2]
 df2["height"] = pd.to_numeric(df2["height"])
 
 
 # Create age column

 today = date.today()

 def age(born):
     if born:
         return today.year - born.year - ((today.month, 
                                       today.day) < (born.month, 
                                                     born.day))
     else:
         return np.nan

 df2['age'] = df2['dob'].apply(age)

10 Oldest Players

    df2.sort_values('age',ascending=False).head(10)

10 Youngest Players

    df2.sort_values('age',ascending=True).head(10)

Squad Sizes

    df2.groupby(['club'])['club'].count().sort_values(ascending=False)

Team's Average Player Age

    plt.ylim([20, 30])
    df2.groupby(['club'])['age'].mean().sort_values(ascending=False).plot.bar()

Burnley appear to not only have one of the highest average player ages but also the owest number of registered players

Top 10 Premiership Appearances

    df2.sort_values('appearances',ascending=False).head(10)

Collective Premiership Appearances per Club

    df2.groupby(['club'])['appearances'].sum().sort_values(ascending=False)

    df2.groupby(['club'])['appearances'].sum().sort_values(ascending=False).plot.bar()

10 Tallest Playes

    df2.sort_values('height',ascending=False).head(10)

10 Shortest Playes

    df2.sort_values('height',ascending=True).head(10)

Nationality totals of Players

    pd.set_option('display.max_rows', 100)
    df.groupby(['nationality'])['club'].count().sort_values(ascending=False)

Nationality totals per club

    pd.set_option('display.max_rows', 500)
    df.groupby(['club','nationality'])['nationality'].count()

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Related tags

Overview

PremiershipPlayerAnalysis

Cleanup Data

10 Oldest Players

10 Youngest Players

Squad Sizes

Team's Average Player Age

Burnley appear to not only have one of the highest average player ages but also the owest number of registered players

Top 10 Premiership Appearances

Collective Premiership Appearances per Club

10 Tallest Playes

10 Shortest Playes

Nationality totals of Players

Nationality totals per club

Owner

Additional tools for particle accelerator data analysis and machine information

Fancy data functions that will make your life as a data scientist easier.

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

Exploring the Top ML and DL GitHub Repositories

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

An Aspiring Drop-In Replacement for NumPy at Scale

Maximum Covariance Analysis in Python

Monitor the stability of a pandas or spark dataframe ⚙︎

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

My solution to the book A Collection of Data Science Take-Home Challenges

VevestaX is an open source Python package for ML Engineers and Data Scientists.

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Office365 (Microsoft365) audit log analysis tool

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

Powerful, efficient particle trajectory analysis in scientific Python.

Time ranges with python

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

peptides.py is a pure-Python package to compute common descriptors for protein sequences