Previously, I laid out (and further explained) the founding statistics of my study. This time around, I am going to start really looking into the data itself. So, what data are we looking at, exactly?
Early on in the process, I thought watching three-to-four games per running back would suffice for a sample size… that belief was misguided. I soon realized that I’d need a much larger collection of games to really find meaning behind individual runner’s stats, so after getting a solid baseline of games for about 15 runners, I began to go through each game from Week 1 through Week 15 (to keep weird, end-of-season games from messing things up) for an individual player, one at a time. I started with LeVeon Bell, went onto Isaiah Crowell, and after a couple weeks, I have amassed complete data for seven backs: Bell, Crowell, Ezekiel Elliott, Devonta Freeman, Jay Ajayi, Melvin Gordon, and Todd Gurley. Those players’ games make up the majority of approximately 135 total games sampled.
With this collection we can compare these players against one another, look at how they stack up with the entire sample, and examine the general trends for the entire sample. Today, I’m going to start with the last of those three. For the ease of procedure past this point, I’ll make clear that for simplification, Yards Per Carry will represent rushing production. It’s not perfect, but perfectly good enough for this exercise.
What affects rushing production? How much so?
Generated Yards Per Carry
Since Running Back and Offensive Line-Generated Yards derive from Yards Per Carry, they will obviously be highly related. Logically, the combination of the two Generated Yards stats yields a perfect R-squared coefficient of 1 with Yards Per Carry. Individually, the back’s production matters much more, with a strong coefficient of .7 compared to a meager .23 from the offensive line. Obviously, if we can figure out RB-GY/C and OL-GY/C for a set of running and blocking units, that’s optimal. But there are a lot different factors that go into these measures, so even predicting these numbers is troublesome. Thus, I’ll be trying to create approximate representations of these stats with simpler numbers. (Also, this is a good early indication that back play does indeed matter more to rushing efficiency.)
Hole Rate, Success Rate, and Generated Yards per Hole Hit
On their own, the Hole Rate (representing an entire blocking unit’s ability to provide opportunities) and Success Rate (representing a back’s ability to get what blockers provide) statistics have almost nothing to do with Yards Per Carry, providing R-squared coefficients of .02 apiece. However, they become quite important when we create our new approximations of RB-GY and OL-GY.
Meanwhile, RB-GY/Hit (representing a back’s creativity and explosiveness) does a fair job estimating Y/C on its own (R-squared coefficient: .42). No matter how you slice it, creativity is vital for rushing production. However, we can make this stat more powerful by combining it with Hole Rate and Success Rate.
With standardized scores for each of the three stats, we combine them to create a new sort of RB-GY/C stat (let’s call it, “Adjusted RB Yards (Per Carry)”):
It’s not quite as predictively powerful for Y/C as RB-GY, but it is much easier to understand and forecast than the catch-all. Instead of one stat where we try to evaluate a multitude of running back traits, we’ve separated Adjusted RB Yards (Adj. RB Y) into two more specific stats (that can be evaluated even further in SR’s case). If we properly weight each stat in the equation, we can get to the power of RB-GY/C too. We’ll get into weighting later. Without it, we’ve already got an R-squared coefficient of .53, not bad.
The final step in reconstructing Yards Per Carry is to add OL-GY back into the equation. The purpose OL-GY/C serves varies a bit between a weighted and unweighted combination of reconstructed RB-Generated Yards Per Carry. Without weight, OL-GY/C is simply itself, recreating the Y/C equation:
(OL-GY + RB-GY) / Carries = Yards / Carries
However, when we weight the equation, Adj. RB Y turns into more of a measure that describes what could become of a rush. It takes the likelihood that a hole will open and that that hole will be reached, and combines it with the runner’s contribution to a correctly-hit hole in order to finally return an expected gain after the back has gotten to a critical point. OL-Generated Yards Per Carry now provides a baseline, the production that can happen if nothing else is doing on the play (no hole opened), something goes wrong (the hole is missed), or the back just doesn’t create. If you’re a math person, you can look at this equation as a sort of slope-intercept formula: as a baseline, OL-GY/C is the intercept while Adjusted RB Yards acts as the slope. Of course, play-to-play and team-to-team, each stat will change. Thus, it’s not a perfect interpretation, but the general baseline-upside (and downside) view of OL-GY/C and Adj. RB Y is the best way to interpret a weighted formula.
The bonus we get from this weighted formula is that we now have a strong predictor of Y/C that can capture all the nuances of the rushing game. Let’s look at how this horribly-named “Supercomposite” breaks down:
This full breakdown now let’s us work both from big to small/top to bottom and small to big/bottom to top; we could start with the larger numbers and use those to evaluate a back’s traits (as was my original intention when I thought them up for scouting) or start with evaluated traits to project or examine performance (the object of this study). When we finally add our specific weights, how well can we perform the latter?
Weighting the Supercomposite
What exactly went into my weighting process? In short, theorizing and plenty of tweaking.
I began by envisioning the scenarios that would cause each stat to matter. This flow chart can simplify the thought process:
As you can see, the two offensive line measures will matter every (rushing) play. As discussed above, OL-GY provides a basic collection of yards that the back should be able to record. With Hole Rate, we see if there’s greater potential to the play (note: OL-GY doesn’t always stop counting once the RB has hit a hole; if it’s huge, the yards continue to go to the OL). If there isn’t a hole, we see if the back can do anything, which relies entirely on RB-Generated Yards. Otherwise, in the cases that there is a hole, Success Rate will see if the back can make use of what’s provided. If he doesn’t, the play is over. If he does, RB-GY/Hit again is put to the test and becomes increasingly important, as this situation is where running backs mostly differentiate themselves, where small skill discrepancies make the difference between five- and twenty-yard gains.
The most logical step here is to make Success Rate a bit less important than Hole Rate, because Hole Rate matters on more plays, and there’s no increased importance to SR like there is for RB-Generated Yards. The next natural reaction to the flow chart is to make Hole Rate and OL-GY worth the same, but I figured OL-GY would be a little bit more important… after lots of weight tweaking, it appears that I was wrong: The Supercomposite correlates best with Yards Per Carry when OL-GY/C and HR are worth the same amount (much of this has to do with overlap between the stats; there’s an R-squared coefficient of .44 between them). The hard part became how I’d weight RB-Generated Yards.
On paper, it falls between the blocking stats and SR (the likelihood of not getting a hole is greater than that of not hitting a hole) in terms of the proportion of plays it matters. However, the importance gets magnified once the player has found daylight. Additionally, RB-GY Per Carry, the original main driver in Y/C correlation, is highly related to RB-GY Per Hole Hit. Thus, after many more weight tests, RB-GY/Hit is the most important stat in the Supercomposite.
Here are the tentative, but running, results for weighting:
- RB-GY/Hole Hit: 50%
- OL-GY/Carry: 17.5%
- Hole Rate: 17.5%
- Success Rate: 15%
- R-squared coefficient, each back’s Supercomposite with each back’s Y/C: .96
- R-squared coefficient, full-sample backs’ Supercomposite with Y/C: .98
The most striking result from creating the Supercomposite was how much the running back’s creativity and explosiveness matter–RB-GY/Hit comprises an entire half of the greater composite! When we look at Y/C leaders from 2016, this starts to make sense. The top of the list is led by the likes of LeSean McCoy, LeVeon Bell, and Ezekiel Elliott, all of which are both creative and deadly in the open field. Jamaal Charles, a perennial Y/C leader when healthy, is one of the most explosive backs in recent history.
Jordan Howard is a strange appearance near the top of that list at a glance, but my stats reinforce his appearance: He had the fifth-highest RB-GY/Hole Hit of the backs I’ve sampled. His strategy of “run through all these smaller dudes” at the second level certainly worked, leading him to such high marks on the way to a great rookie year. I’d like to dive further into his games to see just how well his unconventional approach worked over the entire season.
Also notable is how close Success Rate is to Hole Rate in weight. The average Hole Rate hovers in the 40% range, so it’d logically follow that SR should be worth about 40% as much. However, SR gains importance because the differences between backs are more substantial than those between blocking units.
My next reaction has to be to those R-squared coefficients. I tested all the weighting on my entire sample of backs, and was worried that it’d get killed when I just used it for my full-sample backs, but the correlation actually increased. I’m not too worried about over-fitting here, either, as these weights include pretty round derivations:
- Half weight to RB-GY/Hit, half to everything else
- 70% remaining weight to blocking, 30% to Success Rate (I’m contemplating switching to 67% and 33%, two thirds to one, to further distance the Supercomposite from over-fitting.)
- Blocking weight split evenly between OL-GY/C and Hole Rate
I’m obviously quite pleased that such simple weights could provide that high of a correlation, but I don’t have a huge head about it. When you think about it, the dumbed-down versions of these stats basically recreate Y/C (recall back to the deconstruction of Adjusted RB Yards and how it parallels RB-GY Per Carry). That’s exactly what I set out to do! The Supercomposite has separated the Y/C equation into one that includes variables much easier to digest and understand in singular running back (and offensive line) traits, and then broken down just how important each of those traits are.
I do feel like both coefficients being north of .95 is unsustainable once I begin to add even more games to the sample, if only because it’s really hard to keep such a high level of correlation in general. However, I don’t think the coefficients should dip much lower after standing tall through so many sampled games. The statistic was never meant to be perfect, either.
The greater point is that we now have a basis to say how much each factor of the running game matters. We can use evaluations of each trait and the Supercomposite’s best-fit line to actually forecast Yards Per Carry. We can see what it takes for a strong running back to overcome poor blocking (the subject of my first article), as well as the reverse. With my full-sample backs, we can even assess they and their blocking units’ abilities to contribute to production.
All in all, there’s a lot to digest here. It took me probably ten hours just to fully form my ideas behind this data before I wrote this article (which included plenty additional digestion and adjustment as well). In brief, this part of my study should serve as a basis to say how much each factor of rushing production really contributes to the equation (via the Supercomposite and its weights) and a greater jumping-off point for the entirety of the series. I outlined all the directions that this series can go in the future, and this article was a necessary step in opening those doors.