Machine Learning for Economists: An Introduction

A crash course for economists who would like to learn machine learning.

Why should economists bother at all? Machine learning (ML) generally outperforms econometrics in predictions. And that is why ML is becoming more popular in operations, where econometrics’ advantage in tractability is less valuable. So it’s worth knowing the both, and choose the approach that suits your goals best.

An Introduction

These articles have been written by economists for economists. Other readers may not appreciate constant references to economic analysis and should start from the next section.

  1. Athey, Susan, and Guido Imbens. “NBER Lectures on Machine Learning,” 2015. A shortcut from econometrics to machine learning. Key principles and algorithms. Comparative performance of ML.
  2. Varian, “Big Data: New Tricks for Econometrics.” Some ML algorithms and new sources of data.
  3. Einav and Levin, “The Data Revolution and Economic Analysis.” Mostly about new data.

Applications

Practical applications get little publicity, especially if they are successful. But these materials do give an impression what the field is about.

Government

  1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations.
  2. UK Government, Tax Agent Segmentation.
  3. Data.gov, Applications. Some are ML-based.
  4. StackExchange, Applications.

Governments use ML sparingly. Developers emphasize open data more than algorithms.

Business

  1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things.
  2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard.

Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.

Emerging applications

Various prediction and classification problems. For ML research, see the last section.

  1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers.
  2. CMU ML Department, Student projects. More advanced problems, compared to CS229.

Algorithms

A tree of ML algorithms:

machine_learning_alogrithms
Source

Econometricians may check the math behind the algorithms and find it familiar. Mathematical background:

  1. Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning. Standard reference. More formal approach. [free copy]
  2. James et al., An Introduction to Statistical Learning. Another standard reference by the same authors. More practical approach with coding. [free copy]
  3. Kaggle, Metrics. ML problems are all about minimizing prediction errors. These are various definitions of errors.
  4. (optional) Mitchell, Machine Learning. Close to Hastie, Tibshirani, and Friedman.

For what makes ML different from econometrics, see chapters “Model Assessment and Selection” and “Model Inference and Averaging” in The Elements.

Handy cheat sheets by KDnuggets, Microsoft, and Emanuel Ferm. Also this guideline:

screenshot
Source

Software and Hardware

Stata does not support many ML algorithms. Its counterpart in the ML community is R. R is a language, so you’ll need more tools to make it work:

  1. RStudio. A standard coding environment. Similar to Stata.
  2. CRAN packages for ML.
  3. James et al., An Introduction to Statistical Learning. This text introduces readers to R. Again, it is available for free.

Python is the closest alternative to R. Packages “scikit-learn” and “statsmodels” do ML in Python.

If your datasets and computations get heavier, you can run code on virtual servers by Google and Amazon. They have ML-ready instances that execute code faster. It takes a few minutes to set up one.

Summary

I limited this survey to economic applications. Other applications of ML include computer vision, speech recognition, and artificial intelligence.

The advantage of ML approaches (like neural networks and random forest) over econometrics (linear and logistic regressions) is substantial in these non-economic applications.

Economic systems often have linear properties, so ML is less impressive here. Nonetheless, it does predict things better, and more of practical solutions get done in the ML way.

Research in Machine Learning

  1. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals.
  2. CS journals. Applied ML research also appear in engineering journals.
  3. CS departments. For example: CMU ML Department, PhD dissertations.

Kaggle Challenges and the Value of Data Science

The impact of data on business outcomes is covered with buzzwords. The people in the loop say real things sometimes (examples here), but there’s a twist. Vendors picks only the best cases that sell their stuff, and their clients conceal successes to leave competitors guessing.

Let’s turn to Kaggle for balanced statistics. The Kaggle competitions put participants in the same conditions, which allow for easy comparison. The website maintains public and private leaderboards for each competition, based on test data. I use the set of public leaderboards available here.

Businesses hire many data scientists now. And the first interesting question to the data is: should I select talents carefully or hire people fast? Here’s a test: let’s look at the winning margins on the top of leaderboards. If they’re large, then the skill premium may be large as well, so it’s worth looking for better candidates and pay them more. This is the answer in one chart:

kaggle_winners_handicap

Each line represents a competition. The y-scale shows the final score of a participant as a fraction of the winner’s score. The score is a statistical metrics reflecting the quality of a (typically) prediction of interest, such as revenues, votes, or purchases. In some cases, the higher score is better, in the others, it’s the opposite. Lines are moving in the respective directions.

A single leaderboard from that chart may look like this (insurance-related competition):

kaggle_ranking

This case is slightly unusual because it has distinctive leaders with large handicaps. Still, those who try—the red dots—eventually succeed. The problem is, very few do try:

kaggle_submissions_per_team

In 4,000 cases, a team submitted only a single solution in a competition. Really serious attacks on the problem start with 10+ submissions, which few teams make.

Despite this, many participants end close to the winner:

kaggle_scores_kde

Looking from a different perspective on individual performance, I compare how the same users completed different competitions:

kaggle_place_matrix

These five races involved 500+ users each, and some users overlap. The overlapping shows the Kaggle core: the people who compete regularly and finish high (left-bottom corners of each subplot). Elsewhere, the relationships are weak.

These modest evidences suggest that people matter less and commitment more.

Does time matter? I take the means by the days remaining until the last submission:

kaggle_progress_5

This data belongs to the attempts to predict lemons at car auctions. The higher score is better here, and you see that additional submissions don’t improve the quality of an average submission. The leaders do improve slowly, however. Data scientists find low-hanging fruits in available data quickly and then fight for small improvements with much time investments. For one example, read this detailed journal by Kevin Markham.

A typical disclaimer would mention various limitations of these plots for decision making or of Kaggle competitions for real cases. Yes, while hiring, you need to know more than this. I would emphasize a different thing. Managers like intuitive decisions and confirm them with favorable evidences, including statistical insights. But having numbers this way isn’t the same as thinking that starts from numbers. Most businesses can get almost nothing from data scientists before their managers start thinking from numbers, not to numbers. And this transition from intuition to balanced evidences yields more than improving a single prediction by a few percentage points mentioned here.

Data and replication files on GitHub