R Code to Scrape xG Data From Fbref

I have recently talked about the best data sources for xG and the best ways to use xG data. In this post I’m providing the R code I use which:

  • Scrapes non-penalty xG data (npxG) and expected assist (xA) data for all the players in the Premier League
  • Scrapes npxG for and against data for each team in the Premier League
  • Uses the above data in weighted averages to calculate npxG & xA baselines for each player (which are then used in my FPL model)
  • Uses the above data in weighted averages to calculate offensive and defensive ratings for each team (which are then used in my FPL model)

The data is from FBREF.

You can use this code to apply the previous advice I gave you in order to

  • Make better FPL decisions
  • Build your own FPL model

How to run the Code

To use this code, you need to have some software installed which is capable of running R code i.e., R or Rstudio. I recommend Rstudio as it looks nicer and it’s easier to view the variables the code produces.

I’ve attached two files, one called ‘Alberts_scraping_code.R’ and the other called ‘Alberts_helper_functions.R’. The second script contains a lot of code (specifically, functions) which I use in the first script, where key variables are defined and it’s easier to see what is happening.

Below are the packages I use in my code. When running this code for the first time, it may ask you to install some of these packages, especially if you haven’t used R before. Go ahead and install the packages.

R packages needed to run the code

Below are the 3 main variables which you can change.

Variables to change

‘working_directory’ is the folder where you need to have saved the ‘Alberts_helper_functions.R’ file and where you want to save the csvs which get outputted.

‘window_length’ determines how many games worth of data to use when calculating the weighted averages. I recommend setting this variable to a number between 20 and 30.

The final variable ‘weight_type’ determines if a weighted mean or normal mean is used. It’s currently set to ‘w’ which means a weighted mean is used, if you want to use a normal unweighted mean set it to ‘s’.

Once all the code is run, it should output two csv files in the folder you chose to be your working directory. The first file is called ‘all_player_stats’, which contains the npxG & xA averages calculated for each player. It looks like this:

Sample of the ‘all_player_stats’ output

The second file is called ‘team_stats’, which contains the attack and defence ratings for each team. It looks like this:

Sample of the ‘team_stats’ output

Alternatively, the corresponding dataframes ‘all_player_stats’ and ‘team_stats’ can be viewed in Rstudio.

Let me know on twitter or via email (albyedw@yahoo.co.uk) if you have any questions, and I hope you find the code useful

Google Drive folder with the files to download: https://drive.google.com/drive/folders/1pnumMnD9_Wq4_aVvalhbJlocL27G7kZc?usp=sharing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s