Football data analysis originated in the 1950s, and the first person to apply statistical analysis to football matches was Charles Reep, also known as the “first football data analyst.” Charles Reep was born in 1904, graduated from Plymouth High School, and after completing accounting training, he took the exam for the newly established Royal Air Force Accounting Branch in 1928, ranking first and subsequently joining the Royal Air Force, retiring in 1955. Reep was a passionate Plymouth Argyle fan, and during his service in the Royal Air Force, he attended many lectures on football tactics.
In March 1950, while watching a home game of Swindon Town, Reep became disappointed with the slow pace of the game and the marginalized wingers. After witnessing the home team’s unproductive attacking performance in the first half, he decided to take notes during the second half.
Afterward, he frequently appeared at various football matches to take notes. Through his analysis, he inferred that, on average, a team could score two goals per game at that time. However, with a slight improvement in the conversion rate, the goal count could be increased to three. His analysis caught the attention of Brentford manager Jack Gibbons. From February 1951 until the end of the season, Reep was hired by Brentford Club as a part-time consultant. At that time, there were 14 league games left, and Brentford was at risk of relegation. After Reep’s arrival, Brentford’s average goals per game increased from 1.5 to 3, and they earned 20 points out of the next 28, moving from 16th place to 9th in the league.
Later, Reep shared his analysis in the “News Chronicle.” He concluded that most goals were scored with fewer than three passes, emphasizing the importance of quickly moving the ball toward the opponent’s goal. The faster the ball was passed toward the opponent’s goal, the fewer the passes, and the more goals scored. This theory later became known as “long-ball tactics.” Furthermore, Reep suggested that winning possession near the opponent’s goal led to more goals than complex passing maneuvers starting in one’s own half. Therefore, the ball should be quickly moved to the front, and if possession was lost, efforts should be made to quickly regain it. This resembles what we now know as “high pressing.” One of the matches Reep meticulously recorded and analyzed was the “Match of the Century” between England and Hungary in 1953 at Wembley. At that time, Hungary had already won the gold medal in the football competition at the 1952 Helsinki Olympics and maintained an undefeated streak of 31 games, aiming for the 1954 World Cup championship. In the end, Hungary won the match 6-3 against England.
As Reep pointed out in his previous analysis, Hungary scored four goals that originated from winning possession near England’s goal. Despite the disappointment with the match result, he felt that his analysis had been validated. However, what he didn’t know was that the Hungarian Football Association had also been analyzing football for quite some time, particularly optimizing shot positions, for over two decades. The Hungarian coach at that time, Gustav Sebes, kept detailed notes on formations and tactics, while the Hungarian Football Association maintained files on match and training data to help the team reach its optimal state.
Following the painful defeat, the English Football Association also began actively seeking new insights into football and started extensively recording match data. By the late 1950s, the English FA would often deploy dozens of data analysts in a single match, fervently recording game data with pencils and notebooks. Charles Reep’s excellent work at Brentford caught the attention of the strong English team Millwall at the time, leading to an invitation for him to join as a data analyst. On April 24, 1954, when Millwall won the league championship by defeating Tottenham Hotspur 2-0, Reep’s work was once again validated. At the team’s celebration banquet, Millwall’s manager, Callick, specifically thanked their analyst, Charles Reep. Subsequently, Millwall went on to win the league championship again in 1958 and 1959.
Afterward, Charles Reep was employed by several other teams, including Sheffield United, Coventry City, Plymouth Argyle, Stoke City, Chesterfield, and Cambridge United, among others. He gradually faded from the mainstream view of English football until the 1980s. The reason for Reep’s retirement was that after more than 20 years of development, his theory began to face scrutiny. Jonathan Wilson, the author of “Inverting the Pyramid,” pointed out that Reep’s analysis showed that in the games he studied, 91.5% of attacking passes involved three or fewer passes, logically suggesting that 91.5% of goals should come from such attacks. However, in reality, fewer than 80% of goals came from such attacks. Wilson accused Reep of building a flawed football philosophy based on misinterpreted data. Wilson’s perspective led to criticism of the “long-ball tactics.” Nevertheless, despite the later controversies surrounding Reep’s work, his contributions remain significant in modern football data analysis. His methods laid the foundation for subsequent football statistics and data analysis.
The landmark in the application of modern football match data should be the establishment of Opta in 1996. Opta was the world’s first company with football event data as its core business. Its emergence widely popularized the application of football match data and led the technological development in football match data. It has become the authority in defining football match data. Opta started as a small company recording and analyzing data for the English Premier League. During the 1997-98 Premier League season, Opta became the official data provider for the Premier League. After years of development and mergers, Opta has evolved into the world’s largest sports data collection and service provider. Its business scope has expanded from the initial sports data recording and analysis to various services with sports data analysis as its core competitive strength.
Opta now provides services to sports media, broadcasting companies, technology firms, lottery organizations, clubs, and league providers. They cover statistical data and information from over 30 different sports from approximately 70 countries. In its early days, Opta relied entirely on manual, human-based statistics. Data collection involved on-site data collectors using pen and paper to record information. However, this method could only capture a limited amount of timely data, restricted to key information such as shots, corner kicks, goals, substitutions, and cards. The accuracy of the data was average, and it could only provide approximate times and relevant individuals for key events.
With the development of IT technology, in 2001, Opta took the lead in phasing out the manual statistical methods relying on pen and paper and began transitioning to an information-based collection system with a primary focus on human involvement. The initial data collection system involved overlaying a grid-style representation of the football field, set as a semi-transparent overlay on live broadcast videos. The work of data collectors was no longer as simple as recording key events as it was during the pen and paper era.
Passing routes became the core aspect of their work. Data collectors were required to use drag-and-click methods on the mouse to record each pass by dragging from the starting point to the endpoint and marking who received the pass.
However, the 2D grid representation of the football field was unable to precisely correspond to the imagery on the field in third-person perspective football broadcasts. When data collectors recorded information, the mouse would move to points that were relatively distant from the position of the football in the live broadcast image. The distance was entirely dependent on the experience and subjective judgment of the data collector, leading to significant errors in the data recorded through this working mode. Controversies in data recording during this period were not limited to this issue alone. Judgments on various technical actions, such as dribbles, tackles, and errors, heavily relied on the understanding of the data collection team.
During this period, core data like passing routes was no longer recorded in a discrete manner. Each pass could be correlated with preceding and subsequent events, providing more analytical space for data analysts, coaches, and data enthusiasts. After 2010, optical recognition technology and player tracking systems began to be widely used in football match data collection (with most companies in this business established before 2005). In the field of match data collection through optical recognition technology and player tracking systems, SportVU and Prozone were the first two companies to master the core technology. These two companies are currently under the same umbrella of the Stats Perform group along with Opta.
Both SportVU and Prozone can achieve real-time collection of ball and player position data during a football match through optical recognition technology and player tracking systems, without manual intervention. The information system can generate data collection information in the millions, recording 2000-3000 types of events in a single football match. The data that was originally manually handled is now largely replaced by computer vision (CV), a form of artificial intelligence (AI) technology. However, despite the widespread application of optical recognition and wireless positioning technologies in professional matches and team training, manual recording remains an irreplaceable component in match data collection.
While advanced technological systems can automatically identify basic information such as who is on the field, current positions, and current speeds, they still struggle to discern players’ technical actions, key events, or qualitative judgments, such as shots, passes, offside goals, fouls, errors, and more. Therefore, from another perspective, data collection solutions ultimately need human assistance to gather information. Even with future technological advancements, if certain events are challenging to define in computer language, the final judgment will still rely on human decision-making.
Currently, for the majority of professional football matches, Opta is the sole data collection and provider. Various organizations interpret and process the data collected by Opta, presenting it in various visualizations. Different data interpretation service providers have their own understandings of the data, and these differences in interpretation ultimately reflect in the variations of data provided by each. Such companies include WhoScored, Wyscout, TransferMarket, SofaScore, Squawka, SoccerWay, and more. For instance, WhoScored is a professional football data website that offers both free data and rich data visualization services. It provides users with public data and analysis information for most top football events, covering basic data displays, indicator analysis, technical event analysis, activity hotspots, and even post-match critical replay information provided by dedicated analysts.
For example, Transfermarkt is a website primarily focused on player transfer fees, player valuation assessments, transfer-related information, and various football-related rumors. On the website, users can find the latest transfer news, top transfer records, contract extensions, player valuations, and other key information.
There are also data interpretation and visualization service providers like Wyscout, which operate with a core focus on paid services. Wyscout offers the largest football video and data database, including information on over 550,000 players and more than 200 leagues and tournaments. Wyscout has customized numerous data models and extensively applied the information interpreted by these models in its platform, reports, and APIs. Advanced events such as third assists, goals conceded, interceptions, passes into the penalty area, covering teammates, and other high-level events that are challenging for typical data interpretation service providers to offer are abundantly provided by Wyscout. These types of data significantly enrich the dimensions for evaluating players.
Wyscout’s data collectors also segment each match into over 2000 labeled video clips. Currently, Wyscout’s database stores information from over 4 million matches, covering competitions from the top five European leagues (Premier League, Bundesliga, La Liga, Ligue 1, Serie A) to the most important youth tournaments worldwide. These labeled clips include specific key players and technical events. When coaches or analyst teams want to study opposing players or unearth potential talents, they can go beyond performance data and quickly watch players’ actual match performances, including highlights and mistakes, by combining the clipped content. This significantly enhances the efficiency of scouting and research.
In recent years, FIFA has been actively promoting the application of digital technology in the field of football events, with two well-known applications being Video Assistant Referee (VAR) and Semi-Automated Offside Technology (SAOT). There might be some misconceptions that VAR is just a single technology, but in reality, VAR is a comprehensive technical solution that includes human involvement. The VAR team consists of four Video Assistant Referees (wearing green, dispatched by FIFA) and four Replay Operators (wearing black, dispatched by Hawk-Eye company). All Video Assistant Referees are top-level FIFA officials in charge of video matches. Replay Operators choose and provide the best camera angles based on the referee’s requests.
Semi-Automated Offside Technology (SAOT) utilizes 12 specialized tracking cameras installed above the field to track the ball and 29 data points on each player, calculating their precise positions on the field at a rate of 50 times per second. The collected 29 data points include all body parts relevant to offside situations.
Another crucial determinant in offside event detection is the Inertial Measurement Unit, with sensors located at the center of the match ball. It sends data about the ball to the VAR ROOM at a rate of 500 times per second, allowing for precise detection of the point of contact. Now, with the support of various high-tech technologies and artificial intelligence, the data collection in football matches has become increasingly rich and detailed. However, football matches, as highly complex systems, still have a long way to go in terms of data collection and analysis. Many experts in the field of football data emphasize the need for ongoing research.
For example, Sarah Rudd, a former data scientist at Microsoft and a data analyst at Arsenal for nearly a decade, expressed envy for the vast telemetry data generated in motorsports. Such data can assist teams in making improvements and enhancing performance. In an interview, she mentioned, “We often watch F1 races, and it would be fantastic if football teams could have that kind of data.” She also stated, “There are still many things in football that need to be measured or are currently being measured, but we are not yet sure how to analyze them.”