I took Kaggle competitions to measure internal validity in data science. Validity is an issue because it’s easy to get good predictions with a long feature list. Researchers know about this problem, managers don’t. But managers have the data and researchers don’t. Fortunately, managers release it into the wild sometimes, like for those Kaggle competitions. So let’s look if predictions based on this data remain consistent.
Kaggle has a handy rule for detecting overfitting:
Kaggle competitions are decided by your model’s performance on a test data set. Kaggle has the answers for this data set, but withholds them to compare with your predictions. Your Public score is what you receive back upon each submission (that score is calculated using a statistical evaluation metric, which is always described on the Evaluation page). BUT: Your Public Score is being determined from only a fraction of the test data set — usually between 25-33%. This is the Public Leaderboard, and it shows some relative performance during the competition.
When the competition ends, we take your selected submissions (see below) and score your predictions against the REMAINING FRACTION of the test set, or the private portion. You never receive ongoing feedback about your score on this portion, so it is the Private leaderboard. Final competition results are based on the Private leaderboard, and the Winner is the person(s) at the top of the Private Leaderboard.
Teams can’t win by submitting a lucky model that did well on the public set. Like, if you make a million models with different parameters and then choose the best fit. Instead, consistent solutions must perform well on both public and private sets. This is the validity that makes the model useful.
I scrapped both public and private leaderboards from 165 competitions. Correlation between the public and private scores for popular competitions:
Perfectly consistent solutions would have similar scores on both public (horizontal axis) and private (vertical axis) leaderboards. We would see a straight line. It’s not very straight in some competitions. Points moving away from the diagonals say that solutions don’t digest the new data well and their predictive power is declining.
Correlation for places:
This plot is illustrative for individual skills. When a data scientist gets a high score by luck, he won’t retain the position on the private leaderboard. Otherwise, he retains the position and if others do well, they form a straight line. We again don’t see the straight line in some cases.
What are those “some cases”? One is “restaurant revenue prediction“: predicting revenues for restaurants given geography and demographics. That’s a typical business problem in the sense that the company has few observations and many determinants. Data analysis can’t help here until the company gets the data on thousands of other restaurants. McDonald’s or Starbucks can get more, smaller chains can’t.
“The Analytics Edge” competition is the MIT’s course homework for predicting successful blog posts also suffers from too many factors affecting the outcome.
Sometimes limitations exist by design. Kaggle is running a stock price prediction competition now, but the suggested data can’t do the job. Algorithmic trading relies on handpicked cases with unique data models, and the competition offers just the opposite.
How the same data scientists perform across different competitions:
Yes, we should find more straight lines, but they are not here. Instead, there are dense spots around the bottom left corners. Those are teams that broke into the top 100 on many occasions. They sort of did well without domain knowledge. However, when detected, experts did very well, as in this competition sponsored by an Internet search company.
Many problems remain unfriendly to quants, so solutions may be valid but not powerful. It can be fixed with more information, but other approaches often take over. For example, insiders remain the best investors in the restaurant business. A person runs a local restaurant for ten years. He knows the competitors, prices, costs, margins, clients. Of course, he is a better investor than the chain owner, even if the chain owner has a formal model. Markets work well here and centralized analysis don’t.