Last fall, I used my first exposure to generalized linear models to answer a few questions regarding NFL receiving production. Some of those analyses never quite made it to this blog, but the first that did was an attempt to predict which rookie wide receivers would eventually break out. Beyond understanding the theory underlying the logistic model I put into place, I didn’t know much about the optimal predictive modelling process. In the time since, I’ve picked up a number of methods to improve that process and thought of some additional ways to enhance these models. Today, I’ll detail what all I’ve done to build on this model, and the results to come from those changes.
Improvements – Data
Not only did I update the receiving data I had on hand, I cleaned it up and extended it as well. First and perhaps most significantly, I acquired college receiving data from a new source; as I went along the first time, I noticed several inaccuracies in my original source. This data now comes directly from the Sports Reference API.
Second, while I was at it, acquiring data and formulating predictive features on my own, I added a new phenomenon to look into: College receiving competition. When calculating breakout predictions for the 2020 receiver class, I couldn’t help but notice how strongly the model disliked Henry Ruggs and Jerry Jeudy. Really, it just had never seen two receivers with such strong target competition before. Each receivers’ numbers were dragged down by the other’s (and Devonta Smith’s, and Jaylen Waddle’s, etc.) presence. In turn, I recorded the number of games each receiver logged playing next to Day 1, 2, and 3 NFL Draft picks. Controlling for that competition could very well give the model important context regarding college production.
Finally, I actually modified the dependent variable. Before, I was predicting simply whether a player would ever break out. (A breakout season is defined as one in which they finish in the top 24 of receiving fantasy scoring.) However, after a certain period of time, players will have needed to show life for both dynasty fantasy purposes (you’ll cut a player who hasn’t given you any value after a couple years) and real-life NFL purposes (the end of the four-year rookie contract). Thus, instead of asking whether a player would ever break out, we’re now asking whether a player breaks out within his first four years in the league. Players like DeVante Parker are recategorized from hits to misses.
Improvements – Methodology
Model selection and testing is hard, even when you know what you’re doing. When you don’t, it’s going to be even harder, and it’ll inevitably involve some mistakes. I didn’t know what I was doing particularly well last time around. Now, having learned about lasso regression, stepwise regression, and more forms of variable selection, I was able to quickly, efficiently, and programmatically select strong combinations of predictor variables to test out.
I’ve also learned of the importance in splitting data into training, testing, and validation sets. My model selection process is now much more sound. In the end, I used lasso regression, stepwise regression, and best subsets regression to identify new candidates for predictive models to compare to my previous model. I also manually and iteratively tweaked these models according to some ideas I thought might be beneficial.
The final model
After so much time spent on the improvements above, I came to somewhat of a dissatisfying conclusion: The best model is the one I’ve already been using. This is a logistic regression model using only the round a receiver was drafted (log-transformed), whether the drafting team’s offensive coordinator is returning, the player’s maximum receiving yardage market share in college, the player’s number of breakout seasons in college (market share at or above 20 percent), number of college games played, and the interaction between draft round and the offensive coordinator variable.
I was quite surprised, and a bit disappointed, that my new college competition variable didn’t come into play. It was used in the two next-best models, but not in the original. (In each, it was useful not only on its own, but also as an interaction with market share.) However, looking at the data that could be used to train the model, it makes more sense. Only three receivers were drafted in the first round before 2020 after facing as much Day-1 college competition as Henry Ruggs — before even factoring in Devonta Smith or Jaylen Waddle’s presence. Further, DK Metcalf and AJ Brown are two additional, highly useful data points to train data on in the future, but they were not yet eligible to be trained on either. (Each was actually a Day 2 receiver, but they are the classic case for including college competition in the model.) In time, once the model is able to train itself on players from these superweapon college offenses, I’m confident that the best model will include college competition.
The next most surprising, and continuing, absence was any landing spot context, outside of the offensive coordinator variable. Passing volume, returning quarterback metrics, and target competition measures all were not significantly useful. One of the two aforementioned finalist models involved incoming target competition, but it was ultimately left out. Unlike college competition, I don’t expect any of these variables to muscle into future models, as they have plenty of relevant data to train on already.
By our performance measures (area under the ROC curve of training and testing data) and player-by-player results, I’m satisfied with the model’s performance. The model gives an AUC of 76.8 percent predicting players included in the training set, and the measure falls only to 70.5 percent for the testing set. In other words, when predicting the chances for prospects that the model has never seen before, it assigns a higher probability to a player who would break out by year 4 than to one who wouldn’t, more than 70 percent of the time. (Note: I repeated this process a few times with different seeds for training/testing splits, giving AUC values between 65 and 80 percent for each separate split.)
Now let’s look at the top-predicted members of the testing set:
Calvin Johnson slotting in first is a very promising sign. Other studs, like Julio Jones, Calvin Ridley, and DeAndre Hopkins also make the top 25. By the fourth column, we see that overall, the model’s top prospects tend to have pretty reliably successful outcomes. Toward the end of the list, we start to see the model’s eccentricities in Marquess Wilson, Zach Pascal, Rashard Higgins, and Hank Baskett (who somehow hauled in 908 of New Mexico’s 1429 passing yards in 2004). Each had pretty shockingly strong college production. Hopefully, including each in the training data for rookies in the coming years helps us better take that into consideration.
Now, let’s see the worst predictions for the same data set:
Reassuringly, we see few notable names. Adam Humphries’ 816-yard 2018 season barely snuck him into “breakout” status, but by the spirit of the measure, this isn’t a huge failure. Beyond, we see Terry McLaurin, whose surprising 2019 and just-plain-impressive 2020 have proven him as the exception to the rule that college production is a defining factor in NFL receiving success. Kenny Stills nearly broke out in his second year as well. In the coming years, it’ll be interesting to see how close Hunter Renfrow can approach a breakout season. (Currently, he’s at least 150 yards in a season away from such a milestone.)
As for our predictions when using the entire data set to train the model going forward, here are our top predictions:
As in our first run through the exercise, Demaryius Thomas finds himself atop the pack. At a macro level, I found the model’s general pessimism interesting: Thomas’s predicted breakout probability was 94.7% last time around, and the model’s 20th-favorite receiver was given a probability of 75.4% compared to just 55.8% for Calvin Ridley here. But the model’s pessimism is well-founded; no receiver’s chances of hitting this success mark are particularly high, as less than a quarter of all receivers in the sample broke out by year 4, and less than half of first-round picks broke out by year 4. Those proportions were five percent higher in each case for the previous success metric.
And now, for our worst training predictions:
After having gone over Humphries already, Wes Welker is our only other false negative. Earlier, I said that Terry McLaurin is an exception to the college production rule. Well, Welker is an exception to every rule.
Finally, let’s look at the model’s top predictions for the classes of 2019 and 2020 (neither of which were used to train the full model):
I’m really counting on that college competition variable to come into play sooner than later — I can’t have a model going around ranking Scotty Miller over Henry Ruggs and DK Metcalf for too long.
Another interesting note that arises is how strongly the model likes the 2020 class. With a glance, we can see how the list is dominated by 2020 players. But look beyond 2019 and the 2020 class’s strength persists: Sixteen 2020 prospects are given 20 percent chances or better; the next highest count of such players for a given class since the data set begins in 2003 is 13, while nine classes had just nine or fewer such favored prospects. Brandon Aiyuk is an uninspiring crown jewel, as the model identifies him, but the class more than makes up for its lacking top prospects with its depth.