Consistency in Data Science

I took Kaggle competitions to measure internal validity in data science. Validity is an issue because it’s easy to get good predictions with a long feature list. Researchers know about this problem, managers don’t. But managers have the data and researchers don’t. Fortunately, managers release it into the wild sometimes, like for those Kaggle competitions. So let’s look if predictions based on this data remain consistent.

Kaggle has a handy rule for detecting overfitting:

Kaggle competitions are decided by your model’s performance on a test data set. Kaggle has the answers for this data set, but withholds them to compare with your predictions. Your Public score is what you receive back upon each submission (that score is calculated using a statistical evaluation metric, which is always described on the Evaluation page). BUT: Your Public Score is being determined from only a fraction of the test data set — usually between 25-33%. This is the Public Leaderboard, and it shows some relative performance during the competition.

When the competition ends, we take your selected submissions (see below) and score your predictions against the REMAINING FRACTION of the test set, or the private portion. You never receive ongoing feedback about your score on this portion, so it is the Private leaderboard. Final competition results are based on the Private leaderboard, and the Winner is the person(s) at the top of the Private Leaderboard.

Teams can’t win by submitting a lucky model that did well on the public set. Like, if you make a million models with different parameters and then choose the best fit. Instead, consistent solutions must perform well on both public and private sets. This is the validity that makes the model useful.

I scrapped both public and private leaderboards from 165 competitions. Correlation between the public and private scores for popular competitions:


Perfectly consistent solutions would have similar scores on both public (horizontal axis) and private (vertical axis) leaderboards. We would see a straight line. It’s not very straight in some competitions. Points moving away from the diagonals say that solutions don’t digest the new data well and their predictive power is declining.

Correlation for places:


This plot is illustrative for individual skills. When a data scientist gets a high score by luck, he won’t retain the position on the private leaderboard. Otherwise, he retains the position and if others do well, they form a straight line. We again don’t see the straight line in some cases.

What are those “some cases”? One is “restaurant revenue prediction“: predicting revenues for restaurants given geography and demographics. That’s a typical business problem in the sense that the company has few observations and many determinants. Data analysis can’t help here until the company gets the data on thousands of other restaurants. McDonald’s or Starbucks can get more, smaller chains can’t.

The Analytics Edge” competition is the MIT’s course homework for predicting successful blog posts also suffers from too many factors affecting the outcome.

Sometimes limitations exist by design. Kaggle is running a stock price prediction competition now, but the suggested data can’t do the job. Algorithmic trading relies on handpicked cases with unique data models, and the competition offers just the opposite.

How the same data scientists perform across different competitions:


Yes, we should find more straight lines, but they are not here. Instead, there are dense spots around the bottom left corners. Those are teams that broke into the top 100 on many occasions. They sort of did well without domain knowledge. However, when detected, experts did very well, as in this competition sponsored by an Internet search company.

Many problems remain unfriendly to quants, so solutions may be valid but not powerful. It can be fixed with more information, but other approaches often take over. For example, insiders remain the best investors in the restaurant business. A person runs a local restaurant for ten years. He knows the competitors, prices, costs, margins, clients. Of course, he is a better investor than the chain owner, even if the chain owner has a formal model. Markets work well here and centralized analysis don’t.

Kaggle Challenges and the Value of Data Science

The impact of data on business outcomes is covered with buzzwords. The people in the loop say real things sometimes (examples here), but there’s a twist. Vendors picks only the best cases that sell their stuff, and their clients conceal successes to leave competitors guessing.

Let’s turn to Kaggle for balanced statistics. The Kaggle competitions put participants in the same conditions, which allow for easy comparison. The website maintains public and private leaderboards for each competition, based on test data. I use the set of public leaderboards available here.

Businesses hire many data scientists now. And the first interesting question to the data is: should I select talents carefully or hire people fast? Here’s a test: let’s look at the winning margins on the top of leaderboards. If they’re large, then the skill premium may be large as well, so it’s worth looking for better candidates and pay them more. This is the answer in one chart:


Each line represents a competition. The y-scale shows the final score of a participant as a fraction of the winner’s score. The score is a statistical metrics reflecting the quality of a (typically) prediction of interest, such as revenues, votes, or purchases. In some cases, the higher score is better, in the others, it’s the opposite. Lines are moving in the respective directions.

A single leaderboard from that chart may look like this (insurance-related competition):


This case is slightly unusual because it has distinctive leaders with large handicaps. Still, those who try—the red dots—eventually succeed. The problem is, very few do try:


In 4,000 cases, a team submitted only a single solution in a competition. Really serious attacks on the problem start with 10+ submissions, which few teams make.

Despite this, many participants end close to the winner:


Looking from a different perspective on individual performance, I compare how the same users completed different competitions:


These five races involved 500+ users each, and some users overlap. The overlapping shows the Kaggle core: the people who compete regularly and finish high (left-bottom corners of each subplot). Elsewhere, the relationships are weak.

These modest evidences suggest that people matter less and commitment more.

Does time matter? I take the means by the days remaining until the last submission:


This data belongs to the attempts to predict lemons at car auctions. The higher score is better here, and you see that additional submissions don’t improve the quality of an average submission. The leaders do improve slowly, however. Data scientists find low-hanging fruits in available data quickly and then fight for small improvements with much time investments. For one example, read this detailed journal by Kevin Markham.

A typical disclaimer would mention various limitations of these plots for decision making or of Kaggle competitions for real cases. Yes, while hiring, you need to know more than this. I would emphasize a different thing. Managers like intuitive decisions and confirm them with favorable evidences, including statistical insights. But having numbers this way isn’t the same as thinking that starts from numbers. Most businesses can get almost nothing from data scientists before their managers start thinking from numbers, not to numbers. And this transition from intuition to balanced evidences yields more than improving a single prediction by a few percentage points mentioned here.

Data and replication files on GitHub

Athletes vs. data scientists

Competitions among athletes have quite a long history. Armchair sports don’t. Chess, which comes to mind first, became an important sport, but only in the 20th century.

An even younger example is data-related competitions. Kaggle, CrowdANALYTIX, and HackerRank are major platforms in this case.

But do data scientists compete as furiously as athletes? Well, in some cases, yes. Here’s one example:

(see appendix for how the datasets were constructed)
Merk and Census competitions have about the same number of participants and comparable rewards (but winners for the Census competition were restricted to US citizens only). It may seem surprising that their results look so different. I’ll get back to this in the next post on data competitions.
Technically, all the competitions look alike. The lower bound is zero (minutes, seconds, errors), though only the baseline comparison makes sense. Over time, the baseline for sports declined:
(Winning time for 100m. Source.)
A two-second (-18%) improvement in 112 years.
Competitions in a single dataset look like this (more is better):
(Restricted sample taken from
In general, the quality of predictions substantially increase over a few first weeks. Then marginal returns from efforts decrease. That’s interesting because participants make hundreds of submissions to beat numbers three places beyond the decimal point. That’s a lot of work for a normally modest monetary reward. And, well, the monetary reward makes no sense at all. A prize of $25–50K goes to winners who compete with 250 other teams. These are thousands of hours of data analysis, basically unpaid. This unpaid work doesn’t sound attractive even to sponsors (hosts), which are very careful about paying for crowdsourcing. So, yes, it’s a sport, not work.
Athletics has no overfitting, but that’s an issue in data competitions. For example, comparison between public and private rankings for one of the competitions:

Username Public rank Private rank
Jared Huling 1 283
Yevgeniy 2 7
Attila Balogh 3 231
Abhishek 4 6
Issam Laradji 5 9
Ankush Shah 6 11
Grothendieck 7 50
Thakur Raj Anand 8 247
Manuel Días 9 316
Juventino 10 27


The public rank is computed from predictions on the public dataset. The private rank is based on a different sample unavailable before the finals. The difference is largely attributed to overfitting noisy data (and submitting best-performing random results).
In data competitions, your training is not equal to your performance. That’s valid for sports as well. Athletes break world records during training sessions and then finish far away from the top in real competitions.
This has a perfectly statistical explanation, apart from psychology. In official events, the sample is smaller. A single trial, mostly. Several trials are allowed only in non-simultaneous sports, like high jumps. The sample is many times larger during training. And you’re more likely to find an extreme result in a larger sample.
Anyway, though these games look like fun and games, they’re also simple models for understanding complex processes. Measuring performance has value for human lives. For instance, hiring in most firms is a single-trial interview. And HR folks use simple heuristic rules for candidate assessment. When candidates are nervous, they fail their trial.
Some firms, like major IT companies, do more interviews. Not because they want to help candidates, but because they have more stakeholders whose opinion matters. But this policy increases the number of trials, so these companies hire smarter.
We don’t have many counterfactuals for HR failures, but we can see how inefficient single trials are compared to multiple trials in sports.

Appendix: The data for the first graph

This graph was constructed in the following way.

First, I took the data for major competitions:

  • Athletics, 100m, men. 2012 Olympic Games in London. Link.
  • Biathlon, 10km, men. 2014 Olympic Games in Sochi. Link.
  • Private leaderboard. Census competition on Kaggle. Link.
  • Private leaderboard. Merk competition on Kaggle. Link.
Naturally, ranking criteria are different. Minutes for biathlon, seconds for athletics, weighted mean absolute error for Census, and R^2 for Merk. All but Merk use descending ranking, when less is better. I converted metrics for Merk to descending ranking by taking ( 1 − R^2 ). That is, I ranked players in the Merk competition by the variance left unexplained by the models.
Then in each competition, I took the first place’s result as 100% and converted other results as percentage of this result. After subtracting 100, I had the graph.