Predicting WR production with generalized linear models, part 2 – Breakouts, but later

Previously, I have written about my analysis that uses generalized linear models to predict whether rookies entering their first year will ever have “breakout” seasons. In that case, I used a logistic model, and because logistic models are cool (or, because I am investigating a similar question), I will do that again today. This article’s question, to be as specific as possible, is similar but different: What are the chances that a player, who’s yet to break out, will break out this season?

Note two distinctions: We are asking this question about players who are no longer rookies (and who have yet to have a breakout season) and whether they will break out this season, regardless of future seasons. I plan to tweak the analysis in a coming article and look at the chances that one of these players will ever break out, but not today.

To repeat my previous article, this is the informal process I used to get these results (updates in bold):

My methodology has been pretty simple: Over time, I’ve accumulated data (dating back to 2000) from Pro Football Reference and collegiate production king Peter Howard to put prospect profile data and NFL production into a single data set. Then, I simply tried many, many different combinations to find what seemed like the best set of variables for predicting whether or not a player breaks out this season.

And finally, some notes before I get started:

  • A breakout season is defined as a season in which the receiver ranks in the top 24 of receivers in total Half-PPR points
  • Undrafted receivers were said to be drafted in the 8th round
  • Non-FBS players did not have college production data available, and were estimated using a simpler version of the model which excludes college production

For those unaware, logistic regression is similar to basic linear regression, except you do a little calculation on your fitted values in order to predict binary (0-1) variables. After many iterations and much analysis of deviance and out-of-sample testing, these are the variables I used to fit that regression:

breakout ~ log(Round drafted) +
           Years of NFL experience +
           Fantasy points scored in prev. season +
           Percentage of team's returning 
              receiving yards * whether player is returning +
           Percentage of team's returning pass attempts +
           College games played * Avg. college receiving
                                  yards market share +
           Maximum college Dominator rating

The asterisks represent interaction terms. Here’s the accompanying R printout for those interested:


Last article’s model made strong use of draft capital and college production, with a seemingly minor situational variable (whether the offensive coordinator is returning to the team) to add useful context to the draft capital. This model starts from a similar basis, using draft capital and college production, but layers on much more situational data, in addition to previous NFL experience and production.

To start with what we already know, draft capital is important. Yeah, we get it, receivers have better chances when they’re drafted earlier, cool — done.

(Again, taking the log of the round was the most effective option, when considering draft round versus draft pick, as well as whether or not to take the log of the draft capital.)

The other holdover, college production, is more interesting. Some people (myself included) would expect for most of the information contained in college production to be reflected in players’ previous NFL production, thus making college stats superfluous to this model. However, adding these college stats moves the AUC (to be explained and discussed later) very significantly from 84.6 percent up to 87.0 percent. In hindsight that assumption was short-minded; by the same logic, wouldn’t we expect previous production to reflect all the information held in draft capital as well?

Regardless, there’s a somewhat notable distinction in the college production variables used, between this model and the last model. We have switched from maximum yardage market share to average market share, and added an interaction with college games played to add context. The interaction term is intuitive here, as a higher market share is more impressive if a player sustains it over a longer period. We’ve also added the player’s maximum Dominator rating, which serves a similar role to maximum yardage market share in the previous model — measuring a player’s college peak. Finally, with the interaction between average market share and college games played, we eschewed college breakout seasons, our previous proxy for productive consistency.

It was interesting to me when creating the first model that there was almost no significance to any situational factors for a given rookie. But that changes here, as we’ve added target competition (percentage of returning receiving yardage — it happens to perform better than incoming yardage), quarterback stability (percentage of returning pass attempts), and the WR’s familiarity in the offense (whether they’re returning to the offense). We also include the interaction between returning receiving yardage and returning/not returning. Its interpretation is fairly complex (you don’t have to worry about it too much when considering the overall prediction), but you can view it generally as “little returning production is good news for a returning receiver, and bad news for a non-returner.”

NFL experience (where 2 means the player is in his second year) is overshadowed by the other variables, but it isn’t insignificant. With a negative coefficient value (meaning the prediction decreases with more and more experience), the variable serves to weed out long-time veterans who’ve never panned out.

Lastly, yeah, you should definitely also account for what a guy did the year before. The z-value for previous points scored is incredibly high.

This model performs fairly significantly better than our rookie breakout model. That is reassuring, as we have much more useful information to work with after a player has played at least one season. To visualize this performance, I’ll use an ROC curve again. Here’s my explanation for them, from part 1:

ROC curves plot a model’s sensitivity (how often it predicts a success for an observation that is a success) and 1-specificity (how often it predicts a failure for observations that fail) at various confidence thresholds (for a threshold at .25, a predicted probability of .2 is considered a predicted failure and a predicted probability of .3 a predicted success). The higher both of those measures are, the better, so we want our ROC curves to be as close to straight lines from (0,0) to (0,1), then from (0,1) to (1,1), as possible.

Let’s look at this model’s ROC curve, in blue, compared to that of a hypothetical, useless model, in black:


We can quantify the model’s performance by looking at the area underneath the curve (the aforementioned AUC), which tells us how frequently the model will correctly predict a higher breakout probability for a player who ends up breaking out than for a player who doesn’t end up breaking out. The ROC for this model is 0.87, meaning that it makes that correct prediction 87 percent of the time (up from 84 percent in the last model). That could be worse.

Now, for the fun part, let’s look at past results. Here’s the model’s 25 favorite breakout picks since 2000:


A lot of expected names here. The only player picked after day 2 was Josh Gordon, whose chances rose to near 50-50 after a strong rookie showing. Promisingly, the top four names on the list indeed broke out that season, and equally promisingly (given that those predictions were only 70-76 percent apiece), the fifth did not. Another point of intrigue: Look how few players have gotten strong chances from 2014 to 2018. Maybe there’s something to that, I think aloud.

Given the smallish prediction sizes here (76.3 percent or less), this model also serves as a good reminder that breakout seasons from these types of players are never all that likely. Even when all the signs are positive into the early stages of the season, things can go wrong — observed most recently in Cooper Kupp’s case last year. (Now healthy, he’s been making up for lost time.)

Meanwhile, here are the model’s 25 least favorite breakout chances:


Late round picks, college quarterbacks, veterans, and more. The lowest predicted chances for a player who went on to break out belong to one humorously named Mike Furrey, who rose above a 0.3 percent chance to record 1,086 yards (nearly half of his career total) in 2006. Here is Furrey alongside the 19 other players who most overcame the odds:


At first, you may be surprised to see so many players with such low probabilities end up breaking out, but bear in mind that these 20 successes emerge from roughly 1,000 players with predicted chances of 6.8 percent or less. In fact, the expected number of breakouts for that set of 1,000 is 20.375, so the results are actually pretty strongly in line with expectations.

The only notable trend I see among those players is that many were between their second and fourth years in the same offense that they were in the year before. Sometimes, it’s good enough for a player to slowly work up the depth chart and stake his claim when the time is right, no matter how anonymous they may have been before. (Looking at you, Tyrell Williams.) Otherwise, the other lesson is to give college QBs a solid bump in these models if you hadn’t already.

In time, I’ll get my hands on data to fit these predictions (retroactively) to the 2019 season and (proactively) on the 2020 season.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s