Cross-posted from my blog:
Sabermetrics (noun): the analysis of baseball through objective evidence, especially baseball statistics.
Sabermetrics, a collection of formulas and math-rooted ideas that allows interested parties to obtain an improved understanding of baseball, is a topic about which I've blogged sporadically. The Hardball Times, Fire Joe Morgan, Baseball Prospectus, and Baseball Think Factory, among a bunch of other web sites, are excellent sources for sabermetric perspective.
Here, I discussed predictive baseball statistics. Cliffnotes: a team's adjusted Pythagorean record has better predictive value than its actual record, a hitter's predicted OPS has better predictive value than his actual OPS, and a pitcher's FIP has better predictive value than his ERA. These statements are facts because Pythagorean record, PrOPS, and FIP are less influenced by luck than won-loss record, OPS, and ERA. The idea is to isolate that of which players are in control and to remove that which is more or less random, because doing so yields information that's an improvement over traditional statistics with regard to predicting future performance.
For whatever reason, such thinking has yet to become popular in non-baseball circles. Sure, there's this (basketball's version of sabermetrics) and this (football's), but neither is as advanced or as accepted as the baseball version (to my knowledge, work in the sabermetric vein has yet to be been performed in hockey, soccer, etc.). This is probably the case because baseball is an inherently better match with math and statistics than other sports are, but who knows.
Anyway, I caught myself wondering, the other day, about sabermetrics and tennis. A Google search of "sabermetrics tennis" does yield a third result of this blog, but while semi-interesting, it's fluffy and doesn't offer much math.
I decided to do a little thinking on my own. The goal? To obtain an improved understanding of tennis (sound familiar?). Specifically, my aim was to manipulate tennis statistics in a way that caused them to have improved predictive value.
I mentioned Pythagorean record above. I asked myself, "Would it make sense to attempt to apply this concept to tennis?" At this point, I'm not completely sure of the answer to this question.
That said, I created a procedure to derive something resembling the tennis equivalent. Here are the steps I followed:
1. Find a player's profile at atptennis.com.
2. Open a new tab/window for the player's "YTD [year to date] match facts" link in the upper right portion of the screen.
3. Multiply the player's "Service Games Won %" number by his "Service Games Played" number.
4. Multiply the player's "Return Games Won %" number by his "Return Games Played" number.
5. Round the results of step three and step four to the nearest integers and sum these numbers.
6. Divide the result of step five by the sum of the player's "Service Games Played" number and his "Return Games Played" number.
7. Multiply the result of step six by the player's total number of matches (the sum of the player's wins [W] and the player's losses [L], as seen in his profile page).
8. Round the result of step seven to the nearest tenth in order to obtain a number to which I'll refer as Adjusted wins (Wa).
9. Multiply Wa by a multiplier (yikes, awkward use of language) of 1.309 (see below) in order to obtain a total for Pythagorean wins (Wp).
10. Subtract Wp from the player's total number of matches in order to obtain a total for Pythagorean losses (Lp; Pythagorean Record = Wp-Lp).
3. (.88)(546) = 480.48
4. (.29)(533) = 154.57
5. 480 + 155 = 635
6. 635/(546 + 533) = .5885078777
7. (.5885078777)(37 + 8) = 26.48285449
8. 26.5 = Wa
9. (26.5)(1.309) = 34.7
10. (37 + 8) - 34.7 = 10.3; Pythagorean record = 34.7-10.3
Here are the results for the rest of the players currently ranked in the top 10 in the world as of Monday, June 23, 2008:
Observant viewers of the chart will understand my methodology. I divided each player's W-L percentage (1) by his Wa-La percentage (2) in order to obtain a value I called  (column eight). I added up each player's value for  and divided by 10 in order to derive a mean for , which turned out to be 1.309. All that this means is that so far in 2008, the average top 10 player has a W-L percentage of 1.309 times his Wa-La percentage.
Taking that into account, in theory, top 10 players who have a  of greater than 1.309 have been "lucky" (Federer, Nadal, Djokovic, Davydenko, Ferrer, Roddick, and Nalbandian fall into this category). Meanwhile, top 10 players who have a  of less than 1.309 have been "unlucky" (Blake, Wawrinka, and Gasquet have been "unlucky" so far). NOTE THAT MY METHODOLOGY COMPLETELY IGNORES THE BELIEF THAT CERTAIN PLAYERS ARE MORE ”CLUTCH” THAN OTHER PLAYERS!
In column nine (the rightmost column), I took each player's result from column eight and subtracted the column eight average (1.309, the multiplier I used to derive Pythagorean record). The result was each player's "Luck Score," the difference between his "luck" and the "luck" of the average top 10 player thus far in 2008.
- At the risk of stating the obvious, MY METHODOLOGY IS EXTREMELY ROUGH. My hope is not for this stuff to be put into practice, but for it to inspire a portion of those who understand it to put their brains in motion (and to attempt to improve it).
- Richard Gasquet's  value is a nasty outlier and is skewing the results of a minuscule sample size of numbers. The absolute value of Gasquet's "Luck Score" (.346) is not only the biggest of the scores in question, it's almost three times as big as the next-biggest (Nadal's, .116). There's perhaps merit for redoing the analysis without Gasquet, who has the reputation of being mentally weak and of having a tendency to "choke" away matches (the removal of Gasquet would surely knock Novak Djokovic [+.021] and David Ferrer [+.020] into negative "Luck Score" territory). Additionally, there is surely merit to performing this analysis again A) at the end of the year, when the numbers I used to derive Wa-La are larger and B) for the top 100 players rather than the top 10. Note that all three of these ideas would change the multiplier of 1.309.
- The order from "luckiest" to least "lucky" was Nadal (ranking: 2), Roddick (6), Nalbandian (7), Federer (1), Davydenko (4), Djokovic (3), Ferrer (5), Blake, (8), Wawrinka (9), and Gasquet (10). Thus, while a quick glance at which numbers are green and which are red might indicate that there's a direct relationship between ranking points and "luck," "luck" actually seems to be somewhat "random" within this small sample. Two of the top three "luckiest" players (Roddick and Nalbandian) are out of the top five in the rankings. Furthermore, as I stated above, if not for Gasquet's ridiculously low "Luck Score," both Djokovic (3) and Ferrer (5) would have luck scores in the red.
- The content of the above bullet noted, there's no doubt that to some extent, my methodology is oversimplification. In measuring "luck," I have made the assumption that every player in the top 10 has the same chance to win a close match as every other player in the top 10; I strongly doubt that this is the case (the best players have the most reliable money shots). I should note that baseball's adjusted Pythagorean record is also oversimplification. I'm comfortable, however, making the statement that my tennis methodology is a more flagrant example of it.
- Certain players (like Andy Roddick and Pete Sampras) have mediocre break games but tend to win lots of 7-5 sets and lots of tiebreaks. Players like these will generally have Pythagorean records that are much worse than their actual W-L records and as a result will have inflated "Luck Scores" (this reminds me of knuckleball pitchers appearing "lucky" because they tend to allow very low batting averages on balls in play [see: the blog entry in which I discussed predictive baseball statistics, which is linked at the beginning of the article]). Given this, it would be reasonable to consider adjusting Pythagorean record further through the use of players' career tiebreak records. Of course, tennis statistics aren't even close to as widely maintained as baseball statistics, and I don't know where to find this information (the only page of tennis statistics of which I'm aware is this one).