Machine Learning for Economists: An Introduction

A crash course for economists who would like to learn machine learning.

Why should economists bother at all? Machine learning (ML) generally outperforms econometrics in predictions. And that is why ML is becoming more popular in operations, where econometrics’ advantage in tractability is less valuable. So it’s worth knowing the both, and choose the approach that suits your goals best.

An Introduction

These articles have been written by economists for economists. Other readers may not appreciate constant references to economic analysis and should start from the next section.

  1. Athey, Susan, and Guido Imbens. “NBER Lectures on Machine Learning,” 2015. A shortcut from econometrics to machine learning. Key principles and algorithms. Comparative performance of ML.
  2. Varian, “Big Data: New Tricks for Econometrics.” Some ML algorithms and new sources of data.
  3. Einav and Levin, “The Data Revolution and Economic Analysis.” Mostly about new data.

Applications

Practical applications get little publicity, especially if they are successful. But these materials do give an impression what the field is about.

Government

  1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations.
  2. UK Government, Tax Agent Segmentation.
  3. Data.gov, Applications. Some are ML-based.
  4. StackExchange, Applications.

Governments use ML sparingly. Developers emphasize open data more than algorithms.

Business

  1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things.
  2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard.

Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.

Emerging applications

Various prediction and classification problems. For ML research, see the last section.

  1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers.
  2. CMU ML Department, Student projects. More advanced problems, compared to CS229.

Algorithms

A tree of ML algorithms:

machine_learning_alogrithms
Source

Econometricians may check the math behind the algorithms and find it familiar. Mathematical background:

  1. Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning. Standard reference. More formal approach. [free copy]
  2. James et al., An Introduction to Statistical Learning. Another standard reference by the same authors. More practical approach with coding. [free copy]
  3. Kaggle, Metrics. ML problems are all about minimizing prediction errors. These are various definitions of errors.
  4. (optional) Mitchell, Machine Learning. Close to Hastie, Tibshirani, and Friedman.

For what makes ML different from econometrics, see chapters “Model Assessment and Selection” and “Model Inference and Averaging” in The Elements.

Handy cheat sheets by KDnuggets, Microsoft, and Emanuel Ferm. Also this guideline:

screenshot
Source

Software and Hardware

Stata does not support many ML algorithms. Its counterpart in the ML community is R. R is a language, so you’ll need more tools to make it work:

  1. RStudio. A standard coding environment. Similar to Stata.
  2. CRAN packages for ML.
  3. James et al., An Introduction to Statistical Learning. This text introduces readers to R. Again, it is available for free.

Python is the closest alternative to R. Packages “scikit-learn” and “statsmodels” do ML in Python.

If your datasets and computations get heavier, you can run code on virtual servers by Google and Amazon. They have ML-ready instances that execute code faster. It takes a few minutes to set up one.

Summary

I limited this survey to economic applications. Other applications of ML include computer vision, speech recognition, and artificial intelligence.

The advantage of ML approaches (like neural networks and random forest) over econometrics (linear and logistic regressions) is substantial in these non-economic applications.

Economic systems often have linear properties, so ML is less impressive here. Nonetheless, it does predict things better, and more of practical solutions get done in the ML way.

Research in Machine Learning

  1. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals.
  2. CS journals. Applied ML research also appear in engineering journals.
  3. CS departments. For example: CMU ML Department, PhD dissertations.

Consistency in Data Science

I took Kaggle competitions to measure internal validity in data science. Validity is an issue because it’s easy to get good predictions with a long feature list. Researchers know about this problem, managers don’t. But managers have the data and researchers don’t. Fortunately, managers release it into the wild sometimes, like for those Kaggle competitions. So let’s look if predictions based on this data remain consistent.

Kaggle has a handy rule for detecting overfitting:

Kaggle competitions are decided by your model’s performance on a test data set. Kaggle has the answers for this data set, but withholds them to compare with your predictions. Your Public score is what you receive back upon each submission (that score is calculated using a statistical evaluation metric, which is always described on the Evaluation page). BUT: Your Public Score is being determined from only a fraction of the test data set — usually between 25-33%. This is the Public Leaderboard, and it shows some relative performance during the competition.

When the competition ends, we take your selected submissions (see below) and score your predictions against the REMAINING FRACTION of the test set, or the private portion. You never receive ongoing feedback about your score on this portion, so it is the Private leaderboard. Final competition results are based on the Private leaderboard, and the Winner is the person(s) at the top of the Private Leaderboard.

Teams can’t win by submitting a lucky model that did well on the public set. Like, if you make a million models with different parameters and then choose the best fit. Instead, consistent solutions must perform well on both public and private sets. This is the validity that makes the model useful.

I scrapped both public and private leaderboards from 165 competitions. Correlation between the public and private scores for popular competitions:

kaggle_score_matrix_top_100

Perfectly consistent solutions would have similar scores on both public (horizontal axis) and private (vertical axis) leaderboards. We would see a straight line. It’s not very straight in some competitions. Points moving away from the diagonals say that solutions don’t digest the new data well and their predictive power is declining.

Correlation for places:

kaggle_place_matrix_top_100

This plot is illustrative for individual skills. When a data scientist gets a high score by luck, he won’t retain the position on the private leaderboard. Otherwise, he retains the position and if others do well, they form a straight line. We again don’t see the straight line in some cases.

What are those “some cases”? One is “restaurant revenue prediction“: predicting revenues for restaurants given geography and demographics. That’s a typical business problem in the sense that the company has few observations and many determinants. Data analysis can’t help here until the company gets the data on thousands of other restaurants. McDonald’s or Starbucks can get more, smaller chains can’t.

The Analytics Edge” competition is the MIT’s course homework for predicting successful blog posts also suffers from too many factors affecting the outcome.

Sometimes limitations exist by design. Kaggle is running a stock price prediction competition now, but the suggested data can’t do the job. Algorithmic trading relies on handpicked cases with unique data models, and the competition offers just the opposite.

How the same data scientists perform across different competitions:

kaggle_place_cross_matrix

Yes, we should find more straight lines, but they are not here. Instead, there are dense spots around the bottom left corners. Those are teams that broke into the top 100 on many occasions. They sort of did well without domain knowledge. However, when detected, experts did very well, as in this competition sponsored by an Internet search company.

Many problems remain unfriendly to quants, so solutions may be valid but not powerful. It can be fixed with more information, but other approaches often take over. For example, insiders remain the best investors in the restaurant business. A person runs a local restaurant for ten years. He knows the competitors, prices, costs, margins, clients. Of course, he is a better investor than the chain owner, even if the chain owner has a formal model. Markets work well here and centralized analysis don’t.

How Big Data Informs Economics

In A Fistful of Dollars, Clint Eastwood challenges Gian Maria Volonte with the words, “When a man with .45 meets a man with a rifle, you said, the man with a pistol’s a dead man. Let’s see if that’s true. Go ahead, load up and shoot.”

That’s the right words to challenge big data, which recently reappeared in economics debates (Noah Smith, Chris House via Mark Thoma). Big data is a rifle, but not necessary winning. Economists must have special reasons to abandon small datasets and start messing with more numbers.

Unlike business, which only recently discovered the sexiest job of the future, economists do analytics for the last 150 years. They deal with “big data” for half of that period (I count from 1940, when the CPS started). So, how can the new big data be useful to them?

Let’s find out what big data offers. First of all, more information, of course. Notable cases include predicting the present with Google and Joshua Blumenstock’s use of mobile phones in development economics. Less notable cases encounter the same problem: a decline in the quality of data. Compare long surveys that development economists collect when they do experiments versus what Facebook dares to ask its most loyal users. Despite Facebook having 1.5 bn. observations, economists end up with much better evidences. That’s not about depth alone. Social scientists ask clearer questions, find representative respondents, and take nonresponses seriously. If you do a responsible job, you have to construct smaller but better samples like this.

Second, big data comes with its own tools, which, like econometrics, are deeply rooted in statistics but ignorant about causation:

Big data tools
Big data tools

The slogan is: to predict and to classify. But economics does care about cause and effect relations. Data scientists dispense with these relations because the professional penalty for misidentification is lower than in economics. And, honestly, at this stage, they have more important problems to solve. For example, much time still goes into capacity building and data wrangling.

Hal Varian shows a few compelling technical examples in his 2014 paper. One example comes from Kaggle’s Titanic competition:

Varian - 2014 - Big Data New Tricks for Econometrics
Varian – 2014 – Big Data New Tricks for Econometrics

The task requires predicting whether a person survived the crash or not. The chart says that children had more chances to survive than old passengers, while for the rest age didn’t matter. A regression tree captures this nonlinearity in the age, while logit regression does not. Hence, the big data tool does better than the economics tool.

But an economist who remembers to “always plot the data” is ready for this. Like with other big data tools, it’s useful to know the trees, but something similar is already available on the econometrics workbench.

There’s nothing ideological in these comments on big data. More data potentially available for research is better than less data. And data scientists do things economists can’t. The objection is the following. Economists mostly deal with the problems of two types. Type One, figuring out how n big variables, like inflation and unemployment, interact with each other. Type Two, making practical policy recommendations for the people who typically read nothing more than executive summaries. While big data can inform top-notch economics research, these two problems are easier to solve with simple models and small data. So, a pistol turns out to be better than a rifle.

Kaggle Challenges and the Value of Data Science

The impact of data on business outcomes is covered with buzzwords. The people in the loop say real things sometimes (examples here), but there’s a twist. Vendors picks only the best cases that sell their stuff, and their clients conceal successes to leave competitors guessing.

Let’s turn to Kaggle for balanced statistics. The Kaggle competitions put participants in the same conditions, which allow for easy comparison. The website maintains public and private leaderboards for each competition, based on test data. I use the set of public leaderboards available here.

Businesses hire many data scientists now. And the first interesting question to the data is: should I select talents carefully or hire people fast? Here’s a test: let’s look at the winning margins on the top of leaderboards. If they’re large, then the skill premium may be large as well, so it’s worth looking for better candidates and pay them more. This is the answer in one chart:

kaggle_winners_handicap

Each line represents a competition. The y-scale shows the final score of a participant as a fraction of the winner’s score. The score is a statistical metrics reflecting the quality of a (typically) prediction of interest, such as revenues, votes, or purchases. In some cases, the higher score is better, in the others, it’s the opposite. Lines are moving in the respective directions.

A single leaderboard from that chart may look like this (insurance-related competition):

kaggle_ranking

This case is slightly unusual because it has distinctive leaders with large handicaps. Still, those who try—the red dots—eventually succeed. The problem is, very few do try:

kaggle_submissions_per_team

In 4,000 cases, a team submitted only a single solution in a competition. Really serious attacks on the problem start with 10+ submissions, which few teams make.

Despite this, many participants end close to the winner:

kaggle_scores_kde

Looking from a different perspective on individual performance, I compare how the same users completed different competitions:

kaggle_place_matrix

These five races involved 500+ users each, and some users overlap. The overlapping shows the Kaggle core: the people who compete regularly and finish high (left-bottom corners of each subplot). Elsewhere, the relationships are weak.

These modest evidences suggest that people matter less and commitment more.

Does time matter? I take the means by the days remaining until the last submission:

kaggle_progress_5

This data belongs to the attempts to predict lemons at car auctions. The higher score is better here, and you see that additional submissions don’t improve the quality of an average submission. The leaders do improve slowly, however. Data scientists find low-hanging fruits in available data quickly and then fight for small improvements with much time investments. For one example, read this detailed journal by Kevin Markham.

A typical disclaimer would mention various limitations of these plots for decision making or of Kaggle competitions for real cases. Yes, while hiring, you need to know more than this. I would emphasize a different thing. Managers like intuitive decisions and confirm them with favorable evidences, including statistical insights. But having numbers this way isn’t the same as thinking that starts from numbers. Most businesses can get almost nothing from data scientists before their managers start thinking from numbers, not to numbers. And this transition from intuition to balanced evidences yields more than improving a single prediction by a few percentage points mentioned here.

Data and replication files on GitHub

Athletes vs. data scientists

Competitions among athletes have quite a long history. Armchair sports don’t. Chess, which comes to mind first, became an important sport, but only in the 20th century.

An even younger example is data-related competitions. Kaggle, CrowdANALYTIX, and HackerRank are major platforms in this case.

But do data scientists compete as furiously as athletes? Well, in some cases, yes. Here’s one example:

(see appendix for how the datasets were constructed)
Merk and Census competitions have about the same number of participants and comparable rewards (but winners for the Census competition were restricted to US citizens only). It may seem surprising that their results look so different. I’ll get back to this in the next post on data competitions.
Technically, all the competitions look alike. The lower bound is zero (minutes, seconds, errors), though only the baseline comparison makes sense. Over time, the baseline for sports declined:
(Winning time for 100m. Source.)
A two-second (-18%) improvement in 112 years.
Competitions in a single dataset look like this (more is better):
(Restricted sample taken from chmullig.com)
In general, the quality of predictions substantially increase over a few first weeks. Then marginal returns from efforts decrease. That’s interesting because participants make hundreds of submissions to beat numbers three places beyond the decimal point. That’s a lot of work for a normally modest monetary reward. And, well, the monetary reward makes no sense at all. A prize of $25–50K goes to winners who compete with 250 other teams. These are thousands of hours of data analysis, basically unpaid. This unpaid work doesn’t sound attractive even to sponsors (hosts), which are very careful about paying for crowdsourcing. So, yes, it’s a sport, not work.
Athletics has no overfitting, but that’s an issue in data competitions. For example, comparison between public and private rankings for one of the competitions:

Username Public rank Private rank
Jared Huling 1 283
Yevgeniy 2 7
Attila Balogh 3 231
Abhishek 4 6
Issam Laradji 5 9
Ankush Shah 6 11
Grothendieck 7 50
Thakur Raj Anand 8 247
Manuel Días 9 316
Juventino 10 27

(Source)

The public rank is computed from predictions on the public dataset. The private rank is based on a different sample unavailable before the finals. The difference is largely attributed to overfitting noisy data (and submitting best-performing random results).
In data competitions, your training is not equal to your performance. That’s valid for sports as well. Athletes break world records during training sessions and then finish far away from the top in real competitions.
This has a perfectly statistical explanation, apart from psychology. In official events, the sample is smaller. A single trial, mostly. Several trials are allowed only in non-simultaneous sports, like high jumps. The sample is many times larger during training. And you’re more likely to find an extreme result in a larger sample.
Anyway, though these games look like fun and games, they’re also simple models for understanding complex processes. Measuring performance has value for human lives. For instance, hiring in most firms is a single-trial interview. And HR folks use simple heuristic rules for candidate assessment. When candidates are nervous, they fail their trial.
Some firms, like major IT companies, do more interviews. Not because they want to help candidates, but because they have more stakeholders whose opinion matters. But this policy increases the number of trials, so these companies hire smarter.
We don’t have many counterfactuals for HR failures, but we can see how inefficient single trials are compared to multiple trials in sports.

Appendix: The data for the first graph

This graph was constructed in the following way.

First, I took the data for major competitions:

  • Athletics, 100m, men. 2012 Olympic Games in London. Link.
  • Biathlon, 10km, men. 2014 Olympic Games in Sochi. Link.
  • Private leaderboard. Census competition on Kaggle. Link.
  • Private leaderboard. Merk competition on Kaggle. Link.
Naturally, ranking criteria are different. Minutes for biathlon, seconds for athletics, weighted mean absolute error for Census, and R^2 for Merk. All but Merk use descending ranking, when less is better. I converted metrics for Merk to descending ranking by taking ( 1 − R^2 ). That is, I ranked players in the Merk competition by the variance left unexplained by the models.
Then in each competition, I took the first place’s result as 100% and converted other results as percentage of this result. After subtracting 100, I had the graph.