Soccerdata - Efficiently scrape soccer data from various sources

Last update: Jan 04, 2023

Overview

SoccerData is a collection of wrappers over soccer data from Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, SoFIFA and WhoScored. You get Pandas DataFrames with sensible, matching column names and identifiers across datasets. Data is downloaded when needed and cached locally.

import soccerdata as sd

# Create scraper class instance for the Premier League
five38 = sd.FiveThirtyEight('ENG-Premier League', '1819')

# Fetch dataframes
games = five38.read_games()

To learn how to install, configure and use SoccerData, see the Quickstart guide. For documentation on each of the supported data sources, see the API reference.

Disclaimer: As soccerdata relies on web scraping, any changes to the scraped websites will break the package. Hence, do not expect that all code will work all the time. If you spot any bugs, then please fork it and start a pull request.

Comments

[FBref] 403 error when downloading data

Which Python version are you using?

Python 3.8.13

Which version of soccerdata are you using?

1.0.1

What did you do?

fbref = sd.FBref(leagues="NED-Eredivisie", seasons="2021-2022", proxy='tor') team_season_stats = fbref.read_schedule()

What did you expect to see?

Downloaded team stats

What did you see instead?

requests.exceptions.HTTPError: 403
Client Error: Forbidden for url:
https://fbref.com/en/comps/

opened by koenklomps 9

[General] Selenium fails with SOCKS proxy (for tor) with `WebDriverException: Message: unknown error: net::ERR_PROXY_CONNECTION_FAILED`

Which Python version are you using?
Which version of soccerdata are you using?

import soccerdata as sd
import sys
print(sd.__version__)
print(sys.version)

0.0.2
3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]

What did you do?
What did you expect to see?
What did you see instead?

I tried to set use_tor=True for downloading events for a match with tor running in the background, but read_events ended with an error indicating that the proxy connection failed.

ws = sd.WhoScored(leagues="ENG-Premier League", seasons="20-21", use_tor=True)
events = ws.read_events(match_id=1485185)

[03/19/22 09:54:01] INFO     Saving cached data to                              [_common.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/_common.py):[59](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/_common.py#59)
                             C:\Users\antho\soccerdata\data\WhoScored                        
[03/19/22 09:54:04] INFO     Retrieving game schedule of ENG-Premier League  [whoscored.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py):[314](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py#314)
                             - 2021 from the cache                                           
                    INFO     [2/1] Retrieving game with id=1485185           [whoscored.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py):[499](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py#499)
                    INFO     Scraping                                        [whoscored.py](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py):[577](file:///C:/Users/antho/anaconda3/envs/soccerdata/lib/site-packages/soccerdata/whoscored.py#577)
                             https://www.whoscored.com/Matches/1485185/Live                  
---------------------------------------------------------------------------
WebDriverException                        Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_27592\4024154899.py in <module>
      1 ws = sd.WhoScored(leagues="ENG-Premier League", seasons="20-21", use_tor=True, path_to_browser="c:/users/antho/downloads/chromedriver.exe")
----> 2 events = ws.read_events(match_id=1485185)

~\anaconda3\envs\soccerdata\lib\site-packages\soccerdata\whoscored.py in read_events(self, match_id, force_cache, live)
    507                 filepath,
    508                 var="requirejs.s.contexts._.config.config.params.args.matchCentreData",
--> 509                 no_cache=live,
    510             )
    511             json_data = json.load(reader)

~\anaconda3\envs\soccerdata\lib\site-packages\soccerdata\whoscored.py in _download_and_save(self, url, filepath, max_age, no_cache, var)
    576         if cache_invalid or filepath is None or not filepath.exists():
    577             logger.info("Scraping %s", url)
--> 578             self.driver.get(url)
    579             time.sleep(5 + random.random() * 5)
    580             if "Incapsula incident ID" in self.driver.page_source:

~\anaconda3\envs\soccerdata\lib\site-packages\undetected_chromedriver\__init__.py in get_wrapped(*args, **kwargs)
    495                     },
    496                 )
--> 497             return orig_get(*args, **kwargs)
    498 
    499         self.get = get_wrapped

~\anaconda3\envs\soccerdata\lib\site-packages\undetected_chromedriver\__init__.py in get(self, url)
    533         if self._get_cdc_props():
    534             self._hook_remove_cdc_props()
--> 535         return super().get(url)
    536 
    537     def add_cdp_listener(self, event_name, callback):

~\anaconda3\envs\soccerdata\lib\site-packages\selenium\webdriver\remote\webdriver.py in get(self, url)
    435         Loads a web page in the current browser session.
    436         """
--> 437         self.execute(Command.GET, {'url': url})
    438 
    439     @property

~\anaconda3\envs\soccerdata\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    423         response = self.command_executor.execute(driver_command, params)
    424         if response:
--> 425             self.error_handler.check_response(response)
    426             response['value'] = self._unwrap_value(
    427                 response.get('value', None))

~\anaconda3\envs\soccerdata\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    245                 alert_text = value['alert'].get('text')
    246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 247         raise exception_class(message, screen, stacktrace)
    248 
    249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:

WebDriverException: Message: unknown error: net::ERR_PROXY_CONNECTION_FAILED
  (Session info: headless chrome=99.0.4844.74)
Stacktrace:
Backtrace:
	Ordinal0 [0x00509943+2595139]
	Ordinal0 [0x0049C9F1+2148849]
	Ordinal0 [0x00394528+1066280]
	Ordinal0 [0x00390DB4+1052084]
	Ordinal0 [0x003863BD+1008573]
	Ordinal0 [0x00386F7C+1011580]
	Ordinal0 [0x003865CA+1009098]
	Ordinal0 [0x00385BC6+1006534]
	Ordinal0 [0x00384AD0+1002192]
	Ordinal0 [0x00384FAD+1003437]
	Ordinal0 [0x00395C4A+1072202]
	Ordinal0 [0x003EC19D+1425821]
	Ordinal0 [0x003DB9EC+1358316]
	Ordinal0 [0x003EBAF2+1424114]
	Ordinal0 [0x003DB806+1357830]
	Ordinal0 [0x003B6086+1204358]
	Ordinal0 [0x003B6F96+1208214]
	GetHandleVerifier [0x006AB232+1658114]
	GetHandleVerifier [0x0076312C+2411516]
	GetHandleVerifier [0x0059F261+560433]
	GetHandleVerifier [0x0059E366+556598]
	Ordinal0 [0x004A286B+2173035]
	Ordinal0 [0x004A75F8+2192888]
	Ordinal0 [0x004A76E5+2193125]
	Ordinal0 [0x004B11FC+2232828]
	BaseThreadInitThunk [0x76106739+25]
	RtlGetFullPathName_UEx [0x76FF8E7F+1215]
	RtlGetFullPathName_UEx [0x76FF8E4D+1165]

Here's what my terminal looks like with tor running (prior to calling read_events()

[email protected]:/c/Users/antho/soccerdata$ tor

Mar 19 09:53:33.865 [notice] Tor 0.4.2.7 running on Linux with Libevent 2.1.11-stable, OpenSSL 1.1.1f, Zlib 1.2.11, Liblzma 5.2.4, and Libzstd 1.4.4.
Mar 19 09:53:33.865 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
Mar 19 09:53:33.865 [notice] Read configuration file "/etc/tor/torrc".
Mar 19 09:53:33.866 [notice] Opening Socks listener on 127.0.0.1:9050
Mar 19 09:53:33.866 [notice] Opened Socks listener on 127.0.0.1:9050
Mar 19 09:53:33.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Mar 19 09:53:33.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Mar 19 09:53:34.000 [notice] Bootstrapped 0% (starting): Starting
Mar 19 09:53:34.000 [notice] Starting with guard context "default"
Mar 19 09:53:35.000 [notice] Bootstrapped 5% (conn): Connecting to a relay
Mar 19 09:53:35.000 [notice] Bootstrapped 10% (conn_done): Connected to a relay
Mar 19 09:53:35.000 [notice] Bootstrapped 14% (handshake): Handshaking with a relay
Mar 19 09:53:35.000 [notice] Bootstrapped 15% (handshake_done): Handshake with a relay done
Mar 19 09:53:35.000 [notice] Bootstrapped 75% (enough_dirinfo): Loaded enough directory info to build circuits        
Mar 19 09:53:35.000 [notice] Bootstrapped 90% (ap_handshake_done): Handshake finished with a relay to build circuits  
Mar 19 09:53:35.000 [notice] Bootstrapped 95% (circuit_create): Establishing a Tor circuit
Mar 19 09:53:36.000 [notice] Bootstrapped 100% (done): Done

I've opened my browser to the port to verify that something is running, although this is using an HTTP proxy, so the warning here is expected.

opened by tonyelhabr 7

[FBref] Unable to scrape Men's World Cup stats

Hi @probberechts - this looks like a wonderful set of tools. Can't wait to get stuck deeper into it. Thank you!

Objective: To be able to scrape FBRef stats for historic World Cups (and upcoming 2022 World Cup) from this page

World Cup stats landing page -> https://fbref.com/en/comps/1/World-Cup-Stats Stats page for 2018 World Cup -> https://fbref.com/en/comps/1/2018/2018-FIFA-World-Cup-Stats

1. Adding a new league - Working as expected

In the "Adding additional leagues" (here: https://soccerdata.readthedocs.io/en/latest/usage.html) I successfully added a new league called "INTL-WorldCup"

Content of league_dict.json

{
  "INTL-WorldCup": {
    "FBref": "World-Cup-Stats",
    "season_start": "Aug",
    "season_end": "May"
  }
}

Note: I had to remove a comma from just after the 2nd last curly bracket.

Result: When I sd.FBref.available_leagues() it returns the expected result below

[
  'Big 5 European Leagues Combined',
  'ENG-Premier League',
  'ESP-La Liga',
  'FRA-Ligue 1',
  'GER-Bundesliga',
  'INTL-WorldCup',
  'ITA-Serie A'
]

2. Can I pull back scraped data?

This line ran without error: fbref = sd.FBref(leagues="INTL-WorldCup", seasons=2018)

However, when I ran the 2 lines below

team_season_stats = fbref.read_team_season_stats(stat_type="standard")
team_season_stats.head()

...I got this error below. What am I doing wrong?

ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11984/128004415.py in <module>
----> 1 team_season_stats = fbref.read_team_season_stats(stat_type="standard")
      2 team_season_stats.head()

soccerdata\fbref.py in read_team_season_stats(self, stat_type, opponent_stats)
    252 
    253         # get league IDs
--> 254         seasons = self.read_seasons()
    255 
    256         # collect teams

soccerdata\fbref.py in read_seasons(self)
    169             seasons.append(df_table)
    170 
--> 171         df = pd.concat(seasons).pipe(standardize_colnames)
    172         # A competition name field is not inlcuded in the Big 5 European Leagues Combined
    173         if "competition_name" in df.columns:

~\Miniconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    302         verify_integrity=verify_integrity,
    303         copy=copy,
--> 304         sort=sort,
    305     )
    306 

~\Miniconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    349 
    350         if len(objs) == 0:
--> 351             raise ValueError("No objects to concatenate")
    352 
    353         if keys is None:

ValueError: No objects to concatenate

enhancement

opened by philbywalsh 6

[FBref] Team against stats

Hi! First of all, thank you for the code!

I'd like to ask if it would be possible to also get stats from the "Opponent Stats" table, Thanks!
enhancement

opened by RobiFera 5
Faster scraping of player seasons stats - fbref.
I am having another go at this [previous attempt #69] because you have updated the FBRef class to use the league pages. I have tried this against all stat_types for 2020-2021 and it seems to work

Amended the FBRef scraper so it uses the Big 5 pages if all five leagues are requested.

Added some type checks for the stats_type argument.

I am not able to run the tests locally, but I'll try to fix anything that doesn't work after.
opened by andrewRowlinson 3
[General] Tor port is not up to date
(seen on Windows 11)

The Tor port specified in the code for when initializing a scraper with the proxy="tor" option is 9050, whereas the new Tor versions seem to use port 9150.

Something should be done to check whether port 9050 works in the first place and if it doesn't check with 9150.

For other people it doesn't work for right now, you can always do this :

return_proxies = lambda: { "http": "socks5://127.0.0.1:9150", "https": "socks5://127.0.0.1:9150", } ws = sd.WhoScored(leagues="ENG-Premier League", seasons=20-21, proxy=return_proxies)
documentation
opened by david-leconte 3
[FBref] Can't fetch schedule data
if you run:

import soccerdata as sd fbref = sd.FBref(leagues="ENG-Premier League", seasons=2021) print(fbref.__doc__) epl_schedule = fbref.read_schedule()

You will get an error

frame.py 3832 _set_item value = self._sanitize_column(value)

frame.py 4535 _sanitize_column com.require_length_match(value, self.index)

common.py 557 require_length_match raise ValueError(

ValueError: Length of values (0) does not match length of index (31)
opened by BelkacemB 3
Update dependency Sphinx to v5
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | Sphinx (source) | ^4.3.2 -> ^5.0.0 | | | | | | sphinx (source) | ==4.5.0 -> ==5.0.2 | | | | |

Release Notes

sphinx-doc/sphinx

v5.0.2

Compare Source

=====================================

Features added

#10523: HTML Theme: Expose the Docutils's version info tuple as a template variable, docutils_version_info. Patch by Adam Turner.

Bugs fixed

#10538: autodoc: Inherited class attribute having docstring is documented even if :confval:autodoc_inherit_docstring is disabled

#10509: autosummary: autosummary fails with a shared library

#10497: py domain: Failed to resolve strings in Literal. Patch by Adam Turner.

#10523: HTML Theme: Fix double brackets on citation references in Docutils 0.18+. Patch by Adam Turner.

#10534: Missing CSS for nav.contents in Docutils 0.18+. Patch by Adam Turner.

v5.0.1

Compare Source

=====================================

Bugs fixed

#10498: gettext: TypeError is raised when sorting warning messages if a node has no line number. Patch by Adam Turner.

#10493: HTML Theme: :rst:dir:topic directive is rendered incorrectly with Docutils 0.18. Patch by Adam Turner.

#10495: IndexError is raised for a :rst:role:kbd role having a separator. Patch by Adam Turner.

v5.0.0

Compare Source

=====================================

Dependencies

5.0.0 b1

#10164: Support Docutils 0.18_. Patch by Adam Turner.

.. _Docutils 0.18: https://docutils.sourceforge.io/RELEASE-NOTES.html#release-0-18-2021-10-26

Incompatible changes

5.0.0 b1

#10031: autosummary: sphinx.ext.autosummary.import_by_name() now raises ImportExceptionGroup instead of ImportError when it failed to import target object. Please handle the exception if your extension uses the function to import Python object. As a workaround, you can disable the behavior via grouped_exception=False keyword argument until v7.0.

#9962: texinfo: Customizing styles of emphasized text via @definfoenclose command was not supported because the command was deprecated since texinfo 6.8

#2068: :confval:intersphinx_disabled_reftypes has changed default value from an empty list to ['std:doc'] as avoid too surprising silent intersphinx resolutions. To migrate: either add an explicit inventory name to the references intersphinx should resolve, or explicitly set the value of this configuration variable to an empty list.

#10197: html theme: Reduce body_min_width setting in basic theme to 360px

#9999: LaTeX: separate terms from their definitions by a CR (refs: #9985)

#10062: Change the default language to 'en' if any language is not set in conf.py

5.0.0 final

#10474: :confval:language does not accept None as it value. The default value of language becomes to 'en' now. Patch by Adam Turner and Takeshi KOMIYA.

Deprecated

5.0.0 b1

#10028: jQuery and underscore.js will no longer be automatically injected into themes from Sphinx 6.0. If you develop a theme or extension that uses the jQuery, $, or $u global objects, you need to update your JavaScript or use the mitigation below.

To re-add jQuery and underscore.js, you will need to copy jquery.js and underscore.js from the Sphinx repository_ to your static directory, and add the following to your layout.html:

.. _the Sphinx repository: https://github.com/sphinx-doc/sphinx/tree/v4.3.2/sphinx/themes/basic/static .. code-block:: html+jinja

{%- block scripts %} {{ super() }} {%- endblock %}

Patch by Adam Turner.

setuptools integration. The build_sphinx sub-command for setup.py is marked as deprecated to follow the policy of setuptools team.

The locale argument of sphinx.util.i18n:babel_format_date() becomes required

The language argument of sphinx.util.i18n:format_date() becomes required

sphinx.builders.html.html5_ready

sphinx.io.read_doc()

sphinx.util.docutils.__version_info__

sphinx.util.docutils.is_html5_writer_available()

sphinx.writers.latex.LaTeXWriter.docclasses

Features added

5.0.0 b1

#9075: autodoc: The default value of :confval:autodoc_typehints_format is changed to 'smart'. It will suppress the leading module names of typehints (ex. io.StringIO -> StringIO).

#8417: autodoc: :inherited-members: option now takes multiple classes. It allows to suppress inherited members of several classes on the module at once by specifying the option to :rst:dir:automodule directive

#9792: autodoc: Add new option for autodoc_typehints_description_target to include undocumented return values but not undocumented parameters.

#10285: autodoc: singledispatch functions having typehints are not documented

autodoc: :confval:autodoc_typehints_format now also applies to attributes, data, properties, and type variable bounds.

#10258: autosummary: Recognize a documented attribute of a module as non-imported

#10028: Removed internal usages of JavaScript frameworks (jQuery and underscore.js) and modernised doctools.js and searchtools.js to EMCAScript 2018. Patch by Adam Turner.

#10302: C++, add support for conditional expressions (?:).

#5157, #10251: Inline code is able to be highlighted via :rst:dir:role directive

#10337: Make sphinx-build faster by caching Publisher object during build. Patch by Adam Turner.

Bugs fixed

5.0.0 b1

#10200: apidoc: Duplicated submodules are shown for modules having both .pyx and .so files. Patch by Adam Turner and Takeshi KOMIYA.

#10279: autodoc: Default values for keyword only arguments in overloaded functions are rendered as a string literal

#10280: autodoc: :confval:autodoc_docstring_signature unexpectedly generates return value typehint for constructors if docstring has multiple signatures

#10266: autodoc: :confval:autodoc_preserve_defaults does not work for mixture of keyword only arguments with/without defaults

#10310: autodoc: class methods are not documented when decorated with mocked function

#10305: autodoc: Failed to extract optional forward-ref'ed typehints correctly via :confval:autodoc_type_aliases

#10421: autodoc: :confval:autodoc_preserve_defaults doesn't work on class methods

#10214: html: invalid language tag was generated if :confval:language contains a country code (ex. zh_CN)

#9974: html: Updated jQuery version from 3.5.1 to 3.6.0

#10236: html search: objects are duplicated in search result

#9962: texinfo: Deprecation message for @definfoenclose command on bulding texinfo document

#10000: LaTeX: glossary terms with common definition are rendered with too much vertical whitespace

#10188: LaTeX: alternating multiply referred footnotes produce a ? in pdf output

#10363: LaTeX: make 'howto' title page rule use \linewidth for compatibility with usage of a twocolumn class option

#10318: :prepend: option of :rst:dir:literalinclude directive does not work with :dedent: option

5.0.0 final

#9575: autodoc: The annotation of return value should not be shown when autodoc_typehints="description"

#9648: autodoc: *args and **kwargs entries are duplicated when autodoc_typehints="description"

#8180: autodoc: Docstring metadata ignored for attributes

#10443: epub: EPUB builder can't detect the mimetype of .webp file

#10104: gettext: Duplicated locations are shown if 3rd party extension does not provide correct information

#10456: py domain: :meta: fields are displayed if docstring contains two or more meta-field

#9096: sphinx-build: the value of progress bar for paralle build is wrong

#10110: sphinx-build: exit code is not changed when error is raised on builder-finished event

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

[ ] If you want to rebase/retry this PR, click this checkbox.

This PR has been generated by Mend Renovate. View repository job log here.
opened by renovate[bot] 3
[FBref] Seasons parameter does not work with read_player_season_stats method
FBRef read_player_season_stats method does not seems to be considering the seasons parameter. No matter what the season is mentioned inside the brackets it just retrevied the recent season player stats. 2021-2022.

Code used is mentioned below

import soccerdata as sd fbref = sd.FBref(no_cache=False, no_store=False, leagues="ENG-Premier League", seasons='11-12') pl_player_season_stats = fbref.read_player_season_stats(stat_type='standard') pl_player_season_stats .head()

Seasons parameter works perfectly with the read_team_season_stats method though.
bug
opened by Suwadith 3
FBRef pulling back "out of date" statistics

When using the API to scrape from FBRef I noticed something odd.

The APIs for each category of stats (e.g. shooting, passing) don't seem to be working off the same baseline of minutes played.

For example: take Emiliano Martínez (Argentina). The "90s" value should be consistent across each of these categories - but across the 5 categories below it varies extensively.

fbref.read_player_season_stats(stat_type="standard") = 2.0 fbref.read_player_season_stats(stat_type="shooting") = 2.0 fbref.read_player_season_stats(stat_type="goal_shot_creation") = 3.0 fbref.read_player_season_stats(stat_type="passing") = 3.0 fbref.read_player_season_stats(stat_type="defense") = 5.3

Note: This didn't seem to be an issue earlier in the tournament. Also, I only recently scraped "defense" statistics for the first time, so I wonder if somehow the results of old queries are being cached?

Also, when I navigate to the specific pages in by browser the data looks to be consistent & fully up to-date (i.e. a value of 6.3 for the "90s" attribute for Emiliano Martínez)

https://fbref.com/en/comps/1/passing/World-Cup-Stats

opened by philbywalsh 2
[FBref] NaNs found in 'standard' and 'playing_time' stat_types
Hello,

I have found a small bug when pulling data from FBRef.com. NaN values appearing in the MP columns in the data for stat_types standard and playing_time for players who have played in the season.

I found this problem after I wrote a function to obtain multiple stat_types for multiple seasons and converted the DataFrames from a multiindex to a standard pandas DataFrame. I found a large quantity of NaNs due to this transformation.

To troubleshoot, I did a single pull using the .read_player_season_stats(stat_type = 'standard') call on 2 seasons of data (1718 & 1819) and found NaN values in both the MP and Playing Time MP columns. Players who played and did not play had received NaN values in the aforementioned columns. Under the "Playing Time" section's MP column, I found 890 NaN values and in the standalone 'MP' column, I found 380 NaN values. I am transitioning from R to Python and have always used the flattened-style DataFrame in the past.

Attached is a csv file containing the aforementioned data.

Call:

fbref_test = sd.FBref(leagues=['ENG-Premier League'], seasons= ['1718', '1819']) hold = fbref_test.read_player_season_stats(stat_type = 'standard') hold.head()

I greatly appreciate your assistance. fbref_nan_bug_df.csv
opened by spartanovo 2
[WhoScored] Issue in scrapping if game has no goal

Hello everyone,

If you try to scrap a game from WhoScored using the package, with a 0-0 score, a KeyError occurs. Within the file whoscored.py at line 676, the process try to access the feature 'is_goal', which seems to not exist if the score at the end of the game is 0-0.

Example of game where this issue happen : https://www.whoscored.com/Matches/1640849/Live/England-Premier-League-2022-2023-Newcastle-Leeds

Python code :

import soccerdata as sd ws = sd.WhoScored(leagues="ENG-Premier League", seasons="22-23") events = ws.read_events(match_id=1640849)

Error :

KeyError: ['is_goal'] not in index

A solution could be to check if all the necessary features are indeed part of the dataframe, and if not, add it with np.nan values. The python code could be as follow, before the line 676 :

Python code :

cols = ['event_id', 'expanded_minute', 'is_touch', 'minute', 'outcome_type', 'period', 'qualifiers', 'satisfied_events_types', 'second', 'team_id', 'type', 'x', 'y', 'end_x', 'end_y', 'player_id', 'blocked_x', 'blocked_y', 'goal_mouth_y', 'goal_mouth_z', 'is_shot', 'related_event_id', 'related_player_id', 'is_goal', 'card_type', '$idx', '$len', 'field', 'minute_info', 'satisfiers', 'text', 'game_id', 'player', 'team']

for col in cols: if col not in df.columns: df[col] = np.nan

PS: It exists a similar issue with feature 'card_type' if no card were given during the game.

Thanks, Ben

opened by BenSarfatiDS 0
[WhoScored] Date Format problem

Hello,

I'm trying to pull the schedule from any league, but it keeps getting an error in the date format. Even when I input the match ID, keeps with problem to read the data because of the date format. How can I solve it? ValueError: time data 'Jumatatu, Des 26 2022 12:30' does not match format '%A, %b %d %Y %H:%M'

opened by CBatatinha 2
Update dependency poetry to v1.3.1
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | poetry (source, changelog) | ==1.2.2 -> ==1.3.1 | | | | |

Release Notes

python-poetry/poetry

v1.3.1

Compare Source

Fixed

Fix an issue where an explicit dependency on lockfile was missing, resulting in a broken Poetry in rare circumstances (7169).

v1.3.0

Compare Source

Added

Mark the lock file with an @generated comment as used by common tooling (#2773).

poetry check validates trove classifiers and warns for deprecations (#2881).

Introduce a top level -C, --directory option to set the working path (#6810).

Changed

New lock file format (version 2.0) (#6393).

Path dependency metadata is unconditionally re-locked (#6843).

URL dependency hashes are locked (#7121).

poetry update and poetry lock should now resolve dependencies more similarly (#6477).

poetry publish will report more useful errors when a file does not exist (#4417).

poetry add will check for duplicate entries using canonical names (#6832).

Wheels are preferred to source distributions when gathering metadata (#6547).

Git dependencies of extras are only fetched if the extra is requested (#6615).

Invoke pip with --no-input to prevent hanging without feedback (#6724, #6966).

Invoke pip with --isolated to prevent the influence of user configuration (#6531).

Interrogate environments with Python in isolated (-I) mode (#6628).

Raise an informative error when multiple version constraints overlap and are incompatible (#7098).

Fixed

Fix an issue where concurrent instances of Poetry would corrupt the artifact cache (#6186).

Fix an issue where Poetry can hang after being interrupted due to stale locking in cache (#6471).

Fix an issue where the output of commands executed with --dry-run contained duplicate entries (#4660).

Fix an issue where requests's pool size did not match the number of installer workers (#6805).

Fix an issue where poetry show --outdated failed with a runtime error related to direct origin dependencies (#6016).

Fix an issue where only the last command of an ApplicationPlugin is registered (#6304).

Fix an issue where git dependencies were fetched unnecessarily when running poetry lock --no-update (#6131).

Fix an issue where stdout was polluted with messages that should go to stderr (#6429).

Fix an issue with poetry shell activation and zsh (#5795).

Fix an issue where a url dependencies were shown as outdated (#6396).

Fix an issue where the source field of a dependency with extras was ignored (#6472).

Fix an issue where a package from the wrong source was installed for a multiple-constraints dependency with different sources (#6747).

Fix an issue where dependencies from different sources where merged during dependency resolution (#6679).

Fix an issue where experimental.system-git-client could not be used via environment variable (#6783).

Fix an issue where Poetry fails with an AssertionError due to distribution.files being None (#6788).

Fix an issue where poetry env info did not respect virtualenvs.prefer-active-python (#6986).

Fix an issue where poetry env list does not list the in-project environment (#6979).

Fix an issue where poetry env remove removed the wrong environment (#6195).

Fix an issue where the return code of a script was not relayed as exit code (#6824).

Fix an issue where the solver could silently swallow ValueError (#6790).

Docs

Improve documentation of package sources (#5605).

Correct the default cache path on Windows (#7012).

poetry-core (1.4.0)

The PEP 517 metadata_directory is now respected as an input to the build_wheel hook (#487).

ParseConstraintError is now raised on version and constraint parsing errors, and includes information on the package that caused the error (#514).

Fix an issue where invalid PEP 508 requirements were generated due to a missing space before semicolons (#510).

Fix an issue where relative paths were encoded into package requirements, instead of a file:// URL as required by PEP 508 (#512).

poetry-plugin-export (^1.2.0)

Ensure compatibility with Poetry 1.3.0. No functional changes.

cleo (^2.0.0)

Fix an issue where shell completions had syntax errors (#247).

Fix an issue where not reading all the output of a command resulted in a "Broken pipe" error (#165).

Fix an issue where errors were not shown in non-verbose mode (#166).

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate. View repository job log here.
opened by renovate[bot] 0
Duplicate data after updating to version 1.3.0
After updating from version 1.2.0 to 1.3.0 data from the fbref is repeated twice The sample code

fb_data = fbref.read_player_match_stats( stat_type='summary', match_id=None, force_cache=False) fb_data.to_csv('./summary.csv')

log from version 1.3.0

log from version 1.2.0

In version 1.3.0, the log Retrieving game with id=**** is repeating twice.
opened by DonBrowny 2
Update dependency flake8 to v6
This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | flake8 (changelog) | ^5.0.4 -> ^6.0.0 | | | | |

Release Notes

pycqa/flake8

v6.0.0

Compare Source

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate. View repository job log here.
opened by renovate[bot] 2
In-depth tutorial on how to add new leagues?

Hi,

This is an amazing package! I think the docs are mostly very clear. However, is it possible to have a more in-depth tutorial on how to add new leagues to FBRef? I'm trying to add the English Championship, which is available on FB Ref, but wasn't able to. I added a league_dict.json file (with the correct config I assume) to the "SOCCERDATA_DIR/config/" file path, but it seems like the code is not picking up on it when I call fbref = sd.FBref(leagues="EFL Championship", seasons=2019). It gave me a ValueError noting "Invalid League". Thank you so much!
documentation

opened by cj0121 5

Releases(v1.3.0)

v1.3.0(Nov 26, 2022)
New features

Add support for scraping World Cup data

The World Cup was added to the default available leagues for the WhoScored and FBref readers. Other tournaments can be added by modifying the league_dict.json config file.

from soccerdata import WhoScored, FBref ws = WhoScored(leagues="INT-World Cup", seasons="2022") fb = FBref(leagues="INT-World Cup", seasons="2022")

Changes

The WhoScored reader now uses the non-headless mode by default. Scraping in headless mode typically results in getting blocked quickly. The old behaviour can be recovered by initializing the reader as WhoScored(..., headless=True).

Fixes

The WhoScored reader can now deal with an empty match schedule, which can occur before the start of a season or tournament round.

Source code(tar.gz)
Source code(zip)
v1.2.0(Oct 23, 2022)
New features

Faster scraping of Big 5 leagues stats (by @andrewRowlinson)

FBref has pages for the big five European leagues that allow you to more efficiently get team and player data from multiple leagues. This commit adds a special "Big 5 European Leagues Combined" league option to get data from these pages.

import soccerdata as sd fbref = sd.FBref(leagues="Big 5 European Leagues Combined", seasons="20-21") team_season_stats = fbref.read_team_season_stats(stat_type="standard") player_season_stats = fbref.read_player_season_stats(stat_type="standard")
Source code(tar.gz)
Source code(zip)
v1.1.0(Sep 27, 2022)

New features

FBref

Faster scraping of player season stats (#69)

Previously, the fbref.read_team_season_stats method visited the page of each individual team in a league to obtain stats for players in a league. FBRef now has a single page for each league/season where player stats can be obtained for each player in the league (e.g., https://fbref.com/en/comps/9/stats/Premier-League-Stats). Due to this change the fbref.read_team_season_stats(...) method now uses 15-20x less requests, leading to a large speed-up.

Support retrieving "Opponent Stats" (#78)

A "opponent_stats" flag was added to the fbref.read_season_stats(...) function, which enables retrieving the "Opponent Stats" table of a team.

Always group "MP" under "Playing Time" (#79)

FBRef is inconsistent in how it displays the "MP" (Matches Played) column. For some seasons, it is displayed as a separate category, while it is grouped under "Playing Time" for other seasons. This results in a column with NaN values when two seasons are merged. Therefore, the "MP" column is now always put under "Playing Time".

Docs

Add docs for specifying custom proxy (#83)

Not all Tor distribution use the same default port of 9050. The docs now describe how to configure a custom port.
Source code(tar.gz)
Source code(zip)
v1.0.0(Apr 23, 2022)
Breaking Changes

Several columns were renamed, added and droped in the output dataframes to increase uniformity between datasources.

New features

WhoScored

The WhoScored reader can now return event data in various output formats. The following formats are supported:

A dataframe with all events.

A dict with the original unformatted WhoScored JSON.

A dataframe with the SPADL representation of the original events.

A dataframe with the Atomic-SPADL representation of the original events.

A socceration.data.opta.OptaLoader instance.

No data. This is useful for caching data.

See https://soccerdata.readthedocs.io/en/latest/datasources/WhoScored.html for examples.
Source code(tar.gz)
Source code(zip)
v0.1.0(Apr 22, 2022)
Breaking Changes

The use_tor parameter was replaced by a use_proxy='tor' parameter in all readers

New features

You can specify a custom proxy using the use_proxy parameter for all readers.

ws = soccerdata.WhoScored(use_proxy={'http': 'http://126.352.12.3:5471'})

Fixes

FBref

FBref has implemented a new rate-limiting polity allowing only one request every two seconds. The FBref reader is now configured to comply with this.

Source code(tar.gz)
Source code(zip)
v0.0.3(Mar 20, 2022)
Bugfixes

WhoScored

The summary tab is now used as a backup for retrieving the schedule when the fixtures tab is empty. This often occurs for multi-stage tournaments. (#15)

Fixed incorrect resolver rules for the Tor proxy. (#23)

MatchHistory

Football-data.co.uk switched from http to https only.

Docs

Added example notebooks for reading data from each supported data source.

Source code(tar.gz)
Source code(zip)
v0.0.2(Feb 16, 2022)
Bugfixes

FBref

The FBref reader crashed while scraping match stats, lineups or shots for the current season as it did not handle future games correctly.

Testing Improvements

Sets up CI using Github Actions

Sets up automatic dependency updates using Renovate bot

Source code(tar.gz)
Source code(zip)

Owner

Pieter Robberechts

CS Engineer, PhD student in sports analytics, Data geek

GitHub Repository https://soccerdata.readthedocs.io/en/latest/

Parser manager for parsing DOC, DOCX, PDF or HTML files

Parser manager Description Parser gets PDF, DOC, DOCX or HTML file via API and saves parsed data to the database. Implemented in Ruby 3.0.1 using Acti

4 Dec 04, 2021

NetBox plugin that stores configuration diffs and checks templates compliance

Config Officer - NetBox plugin NetBox plugin that deals with Cisco device configuration (collects running config from Cisco devices, indicates config

77 Dec 21, 2022

An awesome Data Science repository to learn and apply for real world problems.

AWESOME DATA SCIENCE An open source Data Science repository to learn and apply towards solving real world problems. This is a shortcut path to start s

20.3k Jan 09, 2023

Data science on SDGs - Udemy Online Course Material: Data Science on Sustainable Development Goals

Data Science on Sustainable Development Goals (SDGs) Udemy Online Course Material: Data Science on Sustainable Development Goals https://bit.ly/data_s

1 Jan 04, 2022

A collection of online resources to help you on your Tech journey.

Everything Tech Resources & Projects About The Project Coming from an engineering background and looking to up skill yourself on a new field can be di

396 Dec 31, 2022

Documentation generator for C++ based on Doxygen and mosra/m.css.

mosra/m.css is a Doxygen-based documentation generator that significantly improves on Doxygen's default output by controlling some of Doxygen's more unruly options, supplying it's own slick HTML+CSS

109 Dec 07, 2022

EasyMultiClipboard - Python script written to handle more than 1 string in clipboard

1 Jun 18, 2022

Credit EDA Case Study Using Python

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loa

1 Jan 14, 2022

Collections of Beautiful Latex Snippets

HandyLatex Collections of Beautiful Latex Snippets Table 👉 Succinct table with bold separation line and gray text %################## Dependencies ##

15 Apr 11, 2022

freeCodeCamp Scientific Computing with Python Project for Certification.

Polygon_Area_Calculator freeCodeCamp Python Project freeCodeCamp Scientific Computing with Python Project for Certification. In this project you will

1 Dec 23, 2021

Proyecto - Desgaste y rendimiento de empleados de IBM HR Analytics

Acceder al código desde Google Colab para poder ver de manera adecuada todas las visualizaciones y poder interactuar con ellas. Links de acceso: Noteb

1 Jan 31, 2022

API spec validator and OpenAPI document generator for Python web frameworks.

249 Dec 22, 2022

Documentation for the lottie file format

Lottie Documentation This repository contains both human-readable and machine-readable documentation about the Lottie format The documentation is avai

25 Jan 05, 2023

🌱 Complete API wrapper of Seedr.cc

Python API Wrapper of Seedr.cc Table of Contents Installation How I got the API endpoints? Start Guide Getting Token Logging with Username and Passwor

43 Dec 26, 2022

Generating a report CSV and send it to an email - Python / Django Rest Framework

Generating a report in CSV format and sending it to a email How to start project. Create a folder in your machine Create a virtual environment python3

1 Jan 17, 2022

🍭 epub generator for lightnovel.us 轻之国度 epub 生成器

lightnovel_epub 本工具用于基于轻之国度网页生成epub小说。注意：本工具仅作学习交流使用，作者不对内容和使用情况付任何责任！原理直接抓取 HTML，然后将其中的图片下载至本地，随后打包成 EPUB。

188 Dec 30, 2022

Hasköy is an open-source variable sans-serif typeface family

Hasköy Hasköy is an open-source variable sans-serif typeface family. Designed with powerful opentype features and each weight includes latin-extended

67 Jan 04, 2023

Version bêta d'un système pour suivre les prix des livres chez Books to Scrape,

Version bêta d'un système pour suivre les prix des livres chez Books to Scrape, un revendeur de livres en ligne. En pratique, dans cette version bêta, le programme n'effectuera pas une véritable surv

1 Jan 06, 2022

In this Github repository I will share my freqtrade files with you. I want to help people with this repository who don't know Freqtrade so much yet.

My Freqtrade stuff In this Github repository I will share my freqtrade files with you. I want to help people with this repository who don't know Freqt

104 Dec 31, 2022

Near Zero-Overhead Python Code Coverage

Slipcover: Near Zero-Overhead Python Code Coverage by Juan Altmayer Pizzorno and Emery Berger at UMass Amherst's PLASMA lab. About Slipcover Slipcover

325 Dec 28, 2022

Soccerdata - Efficiently scrape soccer data from various sources

Related tags

Overview

Comments

Release Notes

Features added

Bugs fixed

Bugs fixed

Dependencies

Incompatible changes

Deprecated

Features added

Bugs fixed

Configuration

Release Notes

Fixed

Added

Changed

Fixed

Docs

poetry-core (1.4.0)

poetry-plugin-export (^1.2.0)

cleo (^2.0.0)

Configuration

Release Notes

Configuration

Releases(v1.3.0)

v1.3.0(Nov 26, 2022)

New features

Add support for scraping World Cup data

Changes

Fixes

v1.2.0(Oct 23, 2022)

New features

Faster scraping of Big 5 leagues stats (by @andrewRowlinson)

v1.1.0(Sep 27, 2022)

New features

FBref

Faster scraping of player season stats (#69)

Support retrieving "Opponent Stats" (#78)

Always group "MP" under "Playing Time" (#79)

Docs

Add docs for specifying custom proxy (#83)

v1.0.0(Apr 23, 2022)

Breaking Changes

New features

WhoScored

v0.1.0(Apr 22, 2022)

Breaking Changes

New features

Fixes

FBref

v0.0.3(Mar 20, 2022)

Bugfixes

WhoScored

MatchHistory

Docs

v0.0.2(Feb 16, 2022)

Bugfixes

FBref

Testing Improvements

Owner

Pieter Robberechts

Parser manager for parsing DOC, DOCX, PDF or HTML files

NetBox plugin that stores configuration diffs and checks templates compliance

An awesome Data Science repository to learn and apply for real world problems.

Data science on SDGs - Udemy Online Course Material: Data Science on Sustainable Development Goals

A collection of online resources to help you on your Tech journey.

Documentation generator for C++ based on Doxygen and mosra/m.css.

EasyMultiClipboard - Python script written to handle more than 1 string in clipboard

Credit EDA Case Study Using Python

Collections of Beautiful Latex Snippets

freeCodeCamp Scientific Computing with Python Project for Certification.

Proyecto - Desgaste y rendimiento de empleados de IBM HR Analytics

API spec validator and OpenAPI document generator for Python web frameworks.

Documentation for the lottie file format

🌱 Complete API wrapper of Seedr.cc

Generating a report CSV and send it to an email - Python / Django Rest Framework

🍭 epub generator for lightnovel.us 轻之国度 epub 生成器

Hasköy is an open-source variable sans-serif typeface family

poetry-core (`1.4.0`)

poetry-plugin-export (`^1.2.0`)

cleo (`^2.0.0`)