ATP Tennis Cluster Analysis
Using cluster analysis to segment tennis playing styles
In recent years, almost all sports have been part of the analytics revolution. Thrust into the public eye with ‘Moneyball’, the story of the Oakland A’s GM, Billy Beane, who pioneered a math approach to scouting, analytics has spread far and wide to all kinds of sports: basketball with the 3 point revolution, football with events like the “Big Data Bowl,” run by Michael Lopez, and even to the Premier League where Liverpool assisted by being the league leader in analytics took home the league championship.¹ Baseball was a natural starting point for the movement, as it is characterized by 1 vs 1 matchups, a pitcher vs a batter, making it is easier to quantify individual value and decouple positive and negative affects of one’s teammates. One would think that similar innovations have taken place in tennis, another game characterized by 1 vs 1 interactions, however, tennis has lagged far behind other sports in analytics. The only advanced analytic measures tennis fans have been exposed to recently are IBM Watson’s ‘keys to the match’ which highlight, the most important statistic for each player in order to secure a victory, likely from a tree based approach. The progress of tennis analytics in the public domain can be entirely credited to Jeff Sackman, the Bill James of tennis analytics.
Sackman has been tirelessly collecting match statistics, charting matches using a custom coding procedure and releasing data sets on GitHub and his site Tennis Abstract for years.² Furthermore, when Novak Djokovic added Craig O’Shaughnessy to his team, a strategy coach who preaches the value of data in the sport, he gave a symbolic boost to those who have pursued tennis analytic passion projects in the past. Like other games that involve a 1 on 1 matchup, ELO scores have been used to find relative strength of a player.³ However, As many who have played tennis competitively know, tennis is a game uniquely influenced by matchups. Having played tennis in college as a six-foot-seven big server, I loathed coming up against a smaller ‘grinder’ who would stand feet behind the baseline and put as many balls in play without going for winners. I hypothesized that with Sackman’s dataset and K-Means cluster analysis I would be able to find patterns of the different styles of play that characterize tennis, and ultimately conclude which clusters had advantages over their counterparts.
Sackman’s basic ‘box score’ dataset provides a singular row for each match played, dating back as far back as 1968. I chose to analyze files from 2011 on as a relatively arbitrary starting point, but also in order to keep the analysis relevant as the game has shifted significantly in the last 20 years. The stats provide a basic look into each match with metrics such as aces, 1st serves in, double faults, etc. The data set does not provide any rally metrics such as rally length, if the point was won on a winner, forced or unforced error, or whether it was won at the net. However, basic stats such as first serve percentage, percentage of service points won, percentage of return points won give insights into the playing style and relative strengths of each player. After loading the data into Python, I dropped the rows that had null values for any relevant statistics. Next, I created two rows for each match. The first row comprised of the winner’s stats with a unique id that Sackman provides for the winning player, and the second row would follow the same process but for the loser. This step was necessary for two reasons, first the stats are organized by winner and loser (ie. w_ace is the column for winners aces and l_ace is for losers aces) so in order to derive stats for each player, I had to create separate mappings to correspond to their stats for the match regardless of outcome. Secondly, I had to sort by the date and ID for each player in order to compute running totals that would be then used to calculate statistics such as first serve percentage after each match. To give you a feel for the data, below is a screenshot of Djokovic’s career serving statistics following his last available match against Dominic Thiem in the ATP Tour Finals.

I computed the same statistics for each player’s return games, as well as general stats like percent of all points won, and points per minute, as I thought it would likely be correlated with rally length (which, as mentioned above, was not in the data). Surprisingly, even players like Novak Djokovic only win 55% of the points which indicates a relatively low spread between the best players and average players who would win by definition 50% of their points. This would mean that a 1% improvement could be the difference in hundreds of thousands of dollars for many players.
Next I utilized the preprocessing library of Scikit-Learn in order to standardize the data and then fed it to the mini batch clustering functionality available within Scikit-Learn. I attempted different cluster sizes, ranging from 2–10 while looking for the ‘elbow’ in the graph (which seemed to be at 4 clusters). The ‘elbow’ method is a very subjective measure of optimal clusters but was sufficient for my analysis(if you need a tune-up on clustering check here).
The magic of clustering paid off yet again and I was able to find four distinct playing styles within the data. The first cluster was characterized by highest ace percentage, tallest individuals and highest first serve win probability. They won around 50% of their points and played the most amount on hard and grass courts which is to be expected. The next cluster seemed to group mediocre players together. By percentage, they won the least amount of points, spread their matches across all surfaces somewhat evenly and won the least amount of matches (38%). While this was the largest group, players were averaging the least amount of matches in the data set, which likely means these were players back and forth between the challenger tour and the pro tour, who were struggling to ever make it big. If I chose to increase the cluster size I would guess that this group would get broken down at a more granular level. Next we have the ‘all-courter’ which is essentially another way in the tennis world of saying the best individual players. Players like Federer, Nadal, and Djokovic are likely in this group as the cluster collectively wins 53% of their points, give up the least amount of break point opportunities per game and win the most amount of points on their second serve. Finally, we have the clay court grinders. They play 37% of their matches on clay which is the highest by 5%, have the lowest first serve percentage and points won on their first serve, but make up for it by generating the most break points per game on their return. Here is a detailed chart of the various summary statistics for each cluster.
Next, I examined the various winning percentages by each cohort against each other in total as well as on the various surfaces.
Cluster 2, the ‘all-courters’ faired best against all involved and my fear of playing grinders was irrational as the big servers actually won 56% of their matches against their cluster 3 counterparts. What was also interesting was those in cluster 0, the ‘big-hitters’ had a better chance of upsetting cluster 2 than the other cohorts. Intuitively, this makes sense as we have seen players like John Isner or Kevin Anderson get ‘hot’ for a tournament where it seems their power is too much for anybody. On the other side of the spectrum grinders who usually cannot overpower opponents and have to rely on tactics thus are far more consistent in their results. Below are graphs corresponding to hard, clay, and grass with each groups relative advantage over the total win % (court specific win %-total win %) which also further illuminates the big servers expertise on hard and grass courts and the grinders proficiency on clay.

While this serves as a starting point, Jeff Sackman has also published point by point data from the grand slams dating to 2011 which will give further insight into rally metrics that can hopefully further separate the clusters to group by those more offensive and net minded vs those content to rally for 10+ shots from the baseline. Look for that in Part 2 in the upcoming weeks.
- Schoenfeld, Bruce. How Data (and some Breathtaking Soccer) Brought Liverpool to the Cusp of Glory. https://www.nytimes.com/2019/05/22/magazine/soccer-data-liverpool.html
- Sackman, Jeff. Gitbub Home page. https://github.com/JeffSackmann
- Tennis Abstract Elo Ratings. http://tennisabstract.com/reports/atp_elo_ratings.html
- Sackman, Jeff. Measuring the Impact of Break Points. http://www.tennisabstract.com/blog/2019/01/04/measuring-the-impact-of-break-points/