How/What/Why? - Player-Level npxG 
Ever see all these "npxG/90 baselines" and wonder what on Earth is going on
?In this very late 2nd installment of my #FPL spready series I explain how I get my baselines
AND examine how much weight to put on recent #data 

...
IntroAs well as being behind on threads, this was my motivation for investigating weighting:
https://twitter.com/FPLRoosta/status/1344726987655569410?s=19
Skip to:
Weighting
Result
Footnotes{1}
.
npxG = non-penalty expected goals. It's a measure of shot quantity & quality excluding pens {2} 
.
What is an npxG baseline? 
Suppose Salah's is 0.7npxG/90. This means in an average PL fixture{3} I expect his npxG to be 0.7*M/90 if he plays for M minutes.
Per90 is the most accessible/popular, but in my spready I use exact mins{4} (includes injury time) & npxG/min

.
Why exact time?Often it won't matter, but if a player is usually subbed on or usually subbed off & then their situation changes to play more/less mins your data will be out
:https://twitter.com/theFPLkiwi/status/1288400761760686081?s=19
The benefit is ~small but so is the extra effort

.
Where is it from?Baselines are estimates using past data for a player. I pull GW data from @FFScout then put npxG in line with @fbref{5}. Some other data sources
:https://twitter.com/uncertainty_pod/status/1309517194733248517?s=19
(give @uncertainty_pod a listen, especially if you want to do your own spready)

Where exactly?In "data" is npxG & mins for every player in every PL game since the start of 17/18 (36021 npxG data points).
In "fix_list" is team DEF strength (npxGA/game v an avg opponent) for both teams in every PL game
.They are current beliefs of past strength

.
I'll cover these another time, they are similar to e.g. @rogue_wee ( https://twitter.com/rogue_wee/status/1346867567114260481?s=19) except those are past beliefs of past strength, so what he thought at the time from data before that game. Please don't spam him with questions as I believe he's taking a step back 
.

.
Let's use Salah as an example
.
For each of the 131 LIV PL games since joining{7} I have:
d
Fixture difficulty (opponent DEF strength * home/away multiplier)
m
Exact mins
x
@StatsBomb npxG via fbref{6}.
So I consider performance to be a function of x/(dm)
.
.For each of the 131 LIV PL games since joining{7} I have:
d
Fixture difficulty (opponent DEF strength * home/away multiplier)m
Exact minsx
@StatsBomb npxG via fbref{6}.So I consider performance to be a function of x/(dm)
.
Getting technicalLet the i'th game's data be d_i, m_i, x_i for i=1,...,131.
Salah's overall performance (npxG/min vs an avg team) is:
Σx_i / Σd_i*m_i {8}.
For a baseline we value recent data more highly, e.g. 20/21 > 17/18, so I apply a weighting w_i to each game

.
This gives:
Σx_i*w_i / Σd_i*m_i*w_i
(w_i>w_j for all i>j)
Many when starting (inc. me) apply the same weight to all games from the same part of a season, e.g. pre-covid w=0.2, post-covid w=0.5, 20/21 w=1.
Now I make w an exponential function of time
! But why 
?
Σx_i*w_i / Σd_i*m_i*w_i
(w_i>w_j for all i>j)
Many when starting (inc. me) apply the same weight to all games from the same part of a season, e.g. pre-covid w=0.2, post-covid w=0.5, 20/21 w=1.
Now I make w an exponential function of time
! But why 
?
It means the relative weight of 2 games depends only on the time between them.
E.g. The boxing day games 2017-19:
w_57/w_20 = w_95/w_57 = c (a constant).
Let t_i be time in years{9} since game i. The final formula is
:
Σx_i*c^-t_i / Σd_i*m_i*c^-t_i
(I use c=2)
.
E.g. The boxing day games 2017-19:w_57/w_20 = w_95/w_57 = c (a constant).
Let t_i be time in years{9} since game i. The final formula is
:
Σx_i*c^-t_i / Σd_i*m_i*c^-t_i(I use c=2)
.
Finding the best weightingThe goal now is to find the best c for predicting future npxG
.I'll be using fbref npxG as it's the most predictive{10} using @fbref scraped data{11} from @FF_Trout who has been a great sounding board for this work along with @fplreview


.
MethodI tried using all the data but the noise resulted in nonsense (-ve weighting) so...
Clean the dataI took the PL data (every match from 17/18 to now) & removed players with <1500 mins (weighting irrelevant) or <0.1npxG/game (wouldn't pick in FPL)

.
This leaves 11,717 rows of data, 7,236 have a player with >=1500 mins to form a baseline from.
I used my own fixture ratings (DEF strength * home advantage) as I don't know of any public DEF ones, but @FiveThirtyEight do have overall team ratings {12}.
Using the formula

,
I used my own fixture ratings (DEF strength * home advantage) as I don't know of any public DEF ones, but @FiveThirtyEight do have overall team ratings {12}.
Using the formula


,
for each performance I calculated a prior baseline & multiplied by mins&fixture to give an npxG prediction
.
I did this for many values of c, and for each one calculated the rmse{13} of the 7,236 predictions.
I start with a wide range of c values and hone in on the best
.
.I did this for many values of c, and for each one calculated the rmse{13} of the 7,236 predictions.
I start with a wide range of c values and hone in on the best

.
I also considered variations on weighting 
:
Extra fake days in between seasons due to big changes such as transfers
Do this to the lockdown peripd instead of Aug 2020

Include both lockdown & Aug
Subtract days instead (as footballers aren't playing/training).

:
Extra fake days in between seasons due to big changes such as transfers
Do this to the lockdown peripd instead of Aug 2020
Include both lockdown & Aug
Subtract days instead (as footballers aren't playing/training).
The ResultsFor this method and data, the result is clear - the lowest rmse is achieved with a decay rate of:
c ~= 2.1This represents a rate of decay of 2.1 per year, or halving the weight of a game every ~340 days.
You can see this result in the graphs below



:
The variations on weight did not make much difference, and all performed worse than the vanilla model
.
While this matches my intuition, I'm surprised how closely
! I would invite anyone with a different intuition to repeat this as a check against any possible bias 
.
.While this matches my intuition, I'm surprised how closely
! I would invite anyone with a different intuition to repeat this as a check against any possible bias 
.
InterpretationOn average, it is more predictive to use long term player data rather than the last few games or only including this season.
However, @analytic_fpl makes a point here{14} that FPL is a game of identifying outliers - so this result is to be used carefully

!
There are more sophisticated ways to weight games, e.g. higher weight when they play:- the same position
- in the same formation
- with the same teammates
- for the same coach
- where they have "something to play for"
In general checking all these is difficult & noisy

.
ConclusionIn my model I will be setting c = 2.1 from now on. I hope to run a similar experiment later on team strengths, which will then allow me to check again at the player-level for the other top 5 leagues. I can then check assists & other points-scoring actions

.
Footnotes 
{1} Not notes about feet.
{2} Pens done separately.
{3} Average over all fixtures including v LIV despite this being impossible for Salah.
{4} May be elsewhere - I use #FFScout membership https://www.fantasyfootballscout.co.uk/2020/04/20/new-per-90-stat-available-in-ffscout-members-area/
{5} https://fbref.com/en/comps/9/stats/Premier-League-Stats
{6} I get more precise
values than the 1d.p. for each game on fbref by taking snapshots of the 2d.p. per90 player numbers
.
{7} Includes games Salah didn't play.
{8} "Σ" means sum. This method values 3 goals vMCI & 0vWBA the same as 3vWBA & 0vMCI (as both score the same pts). Don't know whether this
.{7} Includes games Salah didn't play.
{8} "Σ" means sum. This method values 3 goals vMCI & 0vWBA the same as 3vWBA & 0vMCI (as both score the same pts). Don't know whether this
is more predictive, so could be investigated another time 
.
{9} Due to leap years I use days in my spready, but won't overcomplicate.
{10} Many examples e.g. @thesignigame: https://twitter.com/thesignigame/status/1341050217467142152?s=19
{11} https://twitter.com/FF_Trout/status/1347668718856368130?s=19
{12} https://github.com/fivethirtyeight/data/tree/master/soccer-spi
{13} Rmse = root mean

.{9} Due to leap years I use days in my spready, but won't overcomplicate.
{10} Many examples e.g. @thesignigame: https://twitter.com/thesignigame/status/1341050217467142152?s=19
{11} https://twitter.com/FF_Trout/status/1347668718856368130?s=19
{12} https://github.com/fivethirtyeight/data/tree/master/soccer-spi
{13} Rmse = root mean
squared error - a standard tool to evaluate predictions. Error = prediction - value.
{14} https://twitter.com/analytic_fpl/status/1344655689944281096?s=19
Thanks to:
@FF_Trout
@fplreview
@analytic_fpl
@FiveThirtyEight
@thesignigame
@fbref
@FFScout
@StatsBomb
@wee_rogue
@uncertainty_pod
@FPLRoosta
Kiwi out
.
{14} https://twitter.com/analytic_fpl/status/1344655689944281096?s=19
Thanks to:@FF_Trout
@fplreview
@analytic_fpl
@FiveThirtyEight
@thesignigame
@fbref
@FFScout
@StatsBomb
@wee_rogue
@uncertainty_pod
@FPLRoosta
Kiwi out

.
Read on Twitter