Challenges for Online Lending Marketplaces

In 2016, online lending marketplaces (OLM) took a hit. Lending Club disclosed falsification of loan data and CEO’s misbehavior. Graduate Leverage’s principal was sentenced to nine years for stealing  $16mn of investors’ money. cut one fourth of its staff. On Deck Capital reduced its growth projections.

Poor performance by the leaders is not an exception. Lending marketplaces in their current form share several common problems.

Tougher regulations

Marketplaces’ misbehavior and increasing volumes of credit attract more scrutiny from federal and state regulators:

  • Interstate operations may require honoring local state anti-usury laws that restrict the interest rates from above. Ongoing court litigations will determine whether loan resellers, such as Lending Club (LC), should be excluded from the federal anti-usury law exemptions.
  • US Treasury, FDIC, and CFPB start requesting more information from lending marketplaces and their partners.
  • The Equal Credit Opportunity Act protects borrowers from discrimination. Banks may avoid discrimination lawsuits by pooling discriminated groups together with other borrowers at higher interest rates. Some online marketplaces simply post loan applications, so disadvantaged borrowers are not rejected directly but simply may be unable to find buyers. Over the recent years, OLMs moved toward more centralized models where loans are packaged and sold by the platform. This can make the platform responsible for unbalanced portfolios of loans. It also applies to selective marketing practices. SoFi, for example, accepts borrowers only from top-tier universities.
  • The risk of being classified as an investment company or a broker-dealer, with more compliance requirements and disclosures.

These regulations increase operating costs and the net interest spread. Which brings us to the other drivers of costs.

Higher costs

Some OLMs present themselves as low-cost, efficient lenders (see Figure 1 and 2 in the appendix). This is doubtful:

  • Being lenders themselves, Santander and JP Morgan bought more than $2bn of Lending Club loans. It means that operating expense of 2-3% goes on top of “traditional lender” operating expense of 5-7% (Figure 1). Institutional investors generate two thirds of LC’s sales.
  • Instead of removing intermediates, as claimed, OLMs add themselves and issuing banks to the chain between borrowers and lenders. See Figure 3: “Investors” buying private instruments (certificates and loans) on the right are professional investors: banks, investment funds, and funds of funds. Each taking commission and creating costs for their clients, so you can extend this scheme to more intermediates before the payments reach the end saver. Public notes sold directly to retail investors generate only one third of sales.
  • No more NOLs. Many OLMs operate with losses that allow them to accumulate tax deductions for the future. Although a general feature of all startups, carryforwards in lending contribute more to profitability because of thin margins and slow growth.


  • Institutional investors are vital to marketplaces. OLMs depend on them for credit lines, sales, funding. Institutional investors enter slowly and quit fast, as it was with LC in May. Such exits create holes in income statements because marketplaces can’t cut fixed costs proportionately to the decline of originated loans.
  • Banks and investment funds may also quit in response to regulatory pressure on marketplaces. Marketplaces enjoy loose regulations, but recent trends are killing this advantage.
  • The buy side uses leverage — another contradiction to the initial idea of the non-levered, non-fractional marketplace.
  • Retail investors don’t do due diligence. OLMs offer retail investors diversified portfolios of personal loans. One investor lends to dozens of borrowers. If he invests $10,000 in 100 prime loans, he earns $500 a year, or $5 per borrower. Due diligence becomes non-economical with these numbers. So retail investors can’t add a personal expertise to the screening process. The platform sets rates itself and sells loans in bulk.
  • The interest rate advantage is overstated. See footnotes of Figure 2: The borrowing rate of 20.7% comes from the LC’s customer survey. That is a credit card sort of rates. Its interpretation is impossible without LC telling the public how they conducted the survey. Secondly, the ROI of 0.06% is a straw man. LC is not a FDIC-insured depository institution. LC sells illiquid unsecured loans. The appropriate asset for comparison would be corporate bonds, which yield 5% for comparable risk. Corporate bonds have a performance history and can be sold without paying a 1% penalty set by LC’s secondary market.


  • The Fed rate hike. OLMs emerged in the late 2000s when the interest rate had been cut to zero and many borrowers could refinance their loans on better terms. As the Fed raises the rate, refinancing and borrowing in general lose their charm.
  • Fewer prime borrowers. Online marketplaces enjoyed an early influx of tech-savvy, doing-well borrowers. Now the borrower base deteriorates and the default rate increases.
  • New data won’t help finding better borrowers. Marketplace underwriters use credit score by TransUnion, Experian, and Equifax. That’s what banks use. Some OLMs also employ unconventional data: borrower’s university, degree, social media activity. That’s barely an advantage. (1) Alternative data is often correlated with credit scores, which makes it redundant for screening;  (2) Banks have access to the same new sources of data, and big banks have much more than that; (3) Some indicators imply discrimination: if you lend to the students of WASPy universities only, expect lawsuits; (4) Frauds based on falsification of alternative data — this may be the most innocent problem. In general, the alternative data has a neutral impact on the competitive advantage in finance: everyone can have it.

No international growth

American startups get high valuation because only they expand internationally. Lending is a highly regulated industry with each country having peculiar rules up to bans on foreign-owned banks. Online lenders currently struggle with state-specific regulations, and serving clients outside the United States is even more problematic.


Lending marketplaces comprise 8 out of 21 financial companies in the WSJ Billion Dollar Club of young private companies. Lending Club and On Deck are public and marked to market. These companies represent two general models of online lending: the marketplace and the single-lender credit line. They also target different borrowers (consumer vs small business). Consumer lending marketplaces are hit hardest by all the above problems. But small business lenders also promise to help underbanked business owners like banks never tried to finance SME before. Anyway, the private equivalents of Lending Club and On Deck will see down rounds after this year’s events.

OLMs did not create a new sector with high margins. They’re ending up paying regulatory and marketing costs in a commodity market. Still, online lenders have (had?) hi-tech-ish multipliers. Perhaps the better comparables would be credit unions and regional banks.

Banks ended up being big after years of M&As. They actually addressed the mentioned problems with growth (and being lucky to survive). Online lending marketplaces don’t have a particular edge against traditional banks. So their organic growth follows the industry’s average. Exit through acquisition? Goldman Sachs recently opened its own personal lending platform, instead of buying an existing player. It seems independent marketplaces need banks more than banks need them. Which makes acquisitions also unlikely.


Figure 1: Lending Club Cost Advantage. Source: LC investor presentation.
Figure 2: Lending Club Rates Advantage. Source: LC investor presentation.
Figure 3: Lending Club Business Model. Source: LC 2015 10k.

Disclaimer: Not an investment advice. For information purposes only. No affiliation with or material interest in the companies mentioned. The future tense is not a promise.

Better Models for Education

A cautionary tale

Four years ago leading universities jumped into the bandwagon of massive open online courses. They didn’t get much more attention since then:

Google Trends

This is international data. In the US, interest in MOOCs declined, despite respectable institutions kept offering new courses on various topics. Is it a marketing failure, which best universities would be proud of, or a bad educational technology?

Let’s see. A typical MOOC consists of

  • lecture slides and exercises
  • a talking head that reads the slides
  • a discussion board, barely alive
  • an optional certificate

Despite many professors having good presentation skills, this technology is not different from a textbook. In fact, ten years before MOOCs, the MIT offered a much better solution: OpenCourseWare — a guideline how to study like an MIT student. It wasn’t tied to particular enrollment dates, pace, or lecturer. Instead, it showed what a diligent student should complete in one semester.

MOOCs became popular after Sebastian Thrun and Peter Norvig had released their open AI course. More than 100,000 students had enrolled, and universities decided to supply more courses. But the AI course was backed by new exciting technologies like self-driving cars and text recognition, while a standard university course covered boring rudiments available in any textbook.

The quality of online courses didn’t improve over time. Each professor appreciated his own brand and didn’t collaborate with colleagues from other universities. So each one had his own course, that is, slides and exercises. For example, a large MOOC provider offers 609 “data science” courses. Students enroll in just a dozen of them, when the lecturer already has a very good reputation. Like Andrew Ng and his machine learning course based on Stanford’s CS229 and available online since 1999.

The history of MOOCs shows how a lot of smart people keep making things that don’t work. Interestingly, it has to do with their core competencies and not online education itself.

Because someone else did better.

Y Combinator: Engaging educators

University professors have little motivation to work with students. Richard Feynman described teaching as “something [to do] so that when I don’t have any ideas and I’m not getting anywhere I can say to myself, ‘At least I’m living; at least I’m doing something; I’m making some contribution’—it’s just psychological.” So when it comes to research vs teaching, many professors choose research.

Anyway, most universities teach future workers, not researchers or educators. Normally, you expect workers teaching workers. Workers raised by professors are like Tarzan raised by gorillas. An innocent problem in a primary school, but the difference in interests increases as education progresses.

How to align the interests of educators and students? By involving the educator in the student’s real passion. That’s what startup accelerators do.

Y Combinator, the most prestigious of accelerators, invests in early-stage startups and puts their founders through a 3-month training program. The 5% stake that Y Combinator acquires for $120K ensures that the mentor’s wellbeing depends on the performance of his students.

Mentorship and apprenticeship are old business practices, of course. Startup accelerators add a social component by bringing many founders to one place. They also escape the research lab hierarchy, when a senior faculty member secures funding and employs graduate students as cheap labor force.

The MIT Media Lab is perhaps the most famous academic lab that operates like a startup accelerator. Professors join the companies founded by their graduates. That’s not a general practice in other universities, in which offering a stake for better mentoring sounds like an insult.

Khan Academy: Engaging students

Engaging students is the second most important task of an educator after engaging himself. This task takes time, so schools and colleges prefer to get rid of the least motivated troublemakers, instead. Many leave college because they see better options. How can educators decrease attrition?

Khan Academy was a one-man project done by a hedge fund analyst in his spare time. The founder taught math on YouTube years before universities started publishing videos of their own classes.

But arguably the best part of Khan Academy appeared later, when students started solving exercises online and getting immediate feedback. Happened before, but Khan Academy polished this technology with data:

In brief, Khan Academy sets the sequence of exercises such that students are not discouraged by frequent failures. It’s part of Khan Academy’s gamification mechanism, which keeps learners motivated throughout K-12.

Stack Exchange: Asking and answering questions

Good educators teach the Socratic way, by asking leading questions. This technique does not scale well in a class with 100+ students. A good alternative is a Q&A website, like StackExchange or Quora.

StackExchange covers many academic subjects up to the graduate level. Its community encourages good questions and punishes for ill-prepared ones. Over time, a motivated person learns how to do preliminary research and ask right questions.

Answering these questions makes more sense than standardized tests or oral exams. Other advantages? Real problems, clear rewards, faster feedback.

Wikipedia: Accumulating knowledge

Wikipedia is fifteen year old, but the education system integrated only one half of it: students copy-paste Wikipedia content into their essays. It should be the other way around! Instead of assigning essays that no one reads, university professors could assign editing Wikipedia articles.

That’s a real contribution. Wikipedia editors check changes and reject the bad ones. It’s easy to track these edits. The Wikimedia Foundation always look for new editors and broader coverage. The content goes straight onto the front page of Google Search.

Despite all the advantages, I saw very few professors who practice this. That’s again about engaging educators, rather than students.

GitHub: Offering creative assignments

GitHub became a Wikipedia for code. Anyone can contribute to a project of interest. The list of open issues suggests possible contributions.

Like Wikipedia and StackExchange, GitHub addresses genuine problems, not synthetic exercises. Software engineers dominate, but any STEM project suits this platform.

Kaggle: Encouraging competition

Though the idea of 3,500 statisticians competing for $50,000 may seem irrational, Kaggle attracted thousands of math-savvy folks to practical problem solving. “Practical” is Kaggle’s key innovation. Competitive problem solving existed before in international olympiads and websites like Hacker Rank. Kaggle made such competitions useful, massive, and scalable.

Some CS departments encourage students to take part in Kaggle competitions. Why here and not on Wikipedia or GitHub? Kaggle challenges look much more like a standardized testing with clear-cut ranking. No need to evaluate whether the student made a useful contribution or just cheated.

Code4Startup: Learning for doing

Learning by doing is an old, popular, and effective technique. But task assignment is a trap. Stupid tasks kill motivation, and the rest dies by itself.

The simplest way to improve motivation is to increase the reward. Startup success stories turned to be a very effective one. More importantly, they are free.

Code4Startup turned this idea into a service. They offer courses showing users how to make a clone of a successful startup. Unlike MOOCs, these courses show how to turn coding and marketing skills into a useful product.

Code School and treehouse take a similar approach.

A honorary mention goes to McDonald’s and Walmart. These companies employ and train the people which top universities would never admit (and other universities get rid of these people after admission). Those who complain about students paying them $50K a year must try to teach a person working for the minimum wage.

A comment

The services I mentioned have nothing to do with the formal education system. Many of them are not even labeled as educational. But they do what colleges are supposed to do, and do it better.

Three more things. (1) These services never associated themselves with colleges. More importantly, none attempted to reform the formal educational system. That’d be an interesting waste of time, as it was for John Dewey and other reformers. (2) These services scale and depend less and less on the limited supply of really good professors. (3) These services specialize. They don’t teach everything; they make narrow tools to improve specific skills.

Comparing their popularity with that of top universities (the MIT is much more popular outside the US; other terms are insensitive to geography):

Google Trends: The United States

Selected services (the two plots have different vertical scales and only trends are comparable; for more, check the links):

Google Trends

So if education is changing, it it’s changing outside traditional institutions.

Can Learning Change Your Mind?

Adam Ozimek asks, “Can Economics Change Your Mind?

In this skeptical view, economists and those who read economics are locked into ideologically motivated beliefs—liberals versus conservatives, for example—and just pick whatever empirical evidence supports those pre-conceived positions. I say this is wrong and solid empirical evidence, even of the complicated econometric sort, changes plenty of minds.

Just to make myself clear, only a human himself can change his mind, and economics can’t. And since the question is basically about learning, not economics, I reformulate the question accordingly: Can Learning Change Your Mind?

The rest turns out to be simple. If I want to change my mind big time, I take an issue I know nothing about and read some research. There will be surprises.

But if I happen to discover big surprises in the area of my competence, I become suspicious. Evidences don’t drop down like Newtonian apples. They flow like a river. Then learning is a flow, too. It’s a continuous process that brings no surprises if you learn constantly.

Where does continuity come from? First, from discounting new studies. New studies have standard limitations, even being factually and methodologically correct. Most frequent limitations concern long-term relationships, external validity, general equilibrium effects. Second, from the nature of the economy itself. Research in economics often speaks in yes-no terms, while economic processes are continuous. For marketing purposes, researchers formulate questions and answers like “Does X cause Y?”, which is a yes-no question tested with regressions. But causation is not about p-values in handpicked models. Causation is also the degree of impact. But this degree jumps wildly even within different specifications of a single model. That means I need a lot of similar studies to change my mind about X and Y.

Removing one letter from Bertrand Russell, “One of the symptoms of approaching nervous breakdown is the belief that one work is terribly important.”

Going back to Adam’s initial (yes-no) question, I’d say yes, some economists “are locked into ideologically motivated beliefs,” and yes, some economists produce knowledge that other people can learn from. These two groups overlap, but it’s no obstacle to good learning.

PS: In his post, Adam Ozimek also asked to submit studies that changed one’s mind. Since I see mind-changing potential as a function of novelty, I’d recommend a simple source of mind-changing studies: visit RePEc’s top cited studies list and read carefully the papers you haven’t read yet. There will be surprises.

Software for Researchers: New Data and Applications

The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn’t do the job.

I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.

Each section ends with a recommended reading list.

Standard Tools

LaTeX and DropBox streamline collaboration. The recommended LaTeX editor is LyX. Zotero and its browser plugin manage the references. LyX supports Zotero via another plugin.

Stata and Matlab do numerical computations. Both are paid, have good support and documentation. Free alternatives: IPython and RStudio to Stata, Octave to Matlab.

Mathematica does symbolic computations. Sage is a free alternative.

  1. Frain, “Applied LATEX for Economists, Social Scientists and Others.” Or a shorter intro to LaTeX by another author.
  2. UCLA, Stata Tutorial. This tutorial fits the economist’s goals. To make it shorter, study Stata’s very basic functionality and then google specific questions.
  3. Varian, “Mathematica for Economists.” Written 20 years ago. Mathematica became more powerful since then. See their tutorials.

New Data Sources

The most general source is the Internet itself. Scraping info from websites sometimes requires a permission (see the website’s terms of use and robots.txt).

Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and extract structured data from webpages. When they can’t, BeautifulSoup and similar parsers can.

Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.

Socrata,, quandl, FRED2 maintain the most comprehensive collection of public datasets. But the universe is much bigger, and exotic data hides elsewhere.

  1. Varian, “Big Data.”
  2. Glaeser et al., “Big Data and Big Cities.”
  3. Athey and Imbens, “Big Data and Economics, Big Data and Economies.”
  4. National Academy of Sciences, Drawing Causal Inference from Big Data [videos]
  5. StackExchange, Open Data. A website for data requests.

One Programming Language

A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.

Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.

Python comes as a standalone installation or in special distributions like Anaconda. For easier troubleshooting, I recommend the standalone installation. Use pip for package management.

Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore’s Law) and coder’s time gets more expensive.

Command line interfaces make massive operations on files easier. For Macs and other *nix systems, learn bash. For Windows, see cmd.exe.

  1. Kevin Sheppard, “Introduction to Python for Econometrics, Statistics and Data Analysis.”
  2. McKinney, Python for Data Analysis. [free demo code from the book]
  3. Sargent and Stachurski, “Quantitative Economics with Python.” The major project using Python and Julia in economics. Check their lectures, use cases, and open source library.
  4. Gentzkow and Shapiro, “What Drives Media Slant?” Natural language processing in media economics.
  5. Dell, “GIS Analysis for Applied Economists.” Use of Python for GIS data. Outdated in technical details, but demonstrates the approach.
  6. Dell, “Trafficking Networks and the Mexican Drug War.” Also see other works in economic geography by Dell.
  7. Repository awesome-python. Best practices.

Version Control and Repository

Version control tracks changes in files. It includes:

  • showing changes made in text files: for taking control over multiple revisions
  • reverting and accepting changes: for reviewing contributions by coauthors
  • support for multiple branches: for tracking versions for different seminars and data sources
  • synchronizing changes across computers: for collaboration and remote processing
  • forking: for other researchers to replicate and extend your work

Version control by Git is a de-facto standard. is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.



A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.

When your project goes north of 1 GB, you can use GitHub’s Large File Storage or alternatives: AWS, Google Cloud,, or torrents.


Jupyter notebooks combine text, code, and output on the same page. See examples:

  1. QuantEcon’s notebooks.
  2. Repository of data-science-ipython-notebooks. Machine learning applications.

Beamer for LaTeX is a standard solution for slides. TikZ for LaTeX draws diagrams and graphics.

Remote Server

Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.

If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.

A typical workflow with version control:

  1. Creating a Git repository
  2. Taking a small sample of data
  3. Coding and debugging research on a local computer
  4. Executing an instance on a remote server
  5. Syncing the code between two locations via Git
  6. Running the code on the full sample on the server

Some services allow writing code in a browser and running it right on their servers.

  1. EC2 AMI for scientific computing in Python and R. Read the last paragraph first.
  2. Amazon, Scientific Computing Using Spot Instances
  3. Google, Datalab

Real-time Applications

Real-time analysis requires optimization for performance. I exemplify with industrial applications:

  1. Jordan, On Computational Thinking, Inferential Thinking and Big Data. A general talk about getting better results faster.
  2. Google, Economics and Electronic Commerce research
  3. Microsoft, Economics and Computation research

The Map

A map for learning new data technologies by Swami Chandrasekaran:



Machine Learning for Economists: An Introduction

A crash course for economists who would like to learn machine learning.

Why should economists bother at all? Machine learning (ML) generally outperforms econometrics in predictions. And that is why ML is becoming more popular in operations, where econometrics’ advantage in tractability is less valuable. So it’s worth knowing the both, and choose the approach that suits your goals best.

An Introduction

These articles have been written by economists for economists. Other readers may not appreciate constant references to economic analysis and should start from the next section.

  1. Athey, Susan, and Guido Imbens. “NBER Lectures on Machine Learning,” 2015. A shortcut from econometrics to machine learning. Key principles and algorithms. Comparative performance of ML.
  2. Varian, “Big Data: New Tricks for Econometrics.” Some ML algorithms and new sources of data.
  3. Einav and Levin, “The Data Revolution and Economic Analysis.” Mostly about new data.


Practical applications get little publicity, especially if they are successful. But these materials do give an impression what the field is about.


  1. Bloomberg and Flowers, “NYC Analytics.” NYC Mayor’s Office of Data Analysis describes their data management system and improvements in operations.
  2. UK Government, Tax Agent Segmentation.
  3., Applications. Some are ML-based.
  4. StackExchange, Applications.

Governments use ML sparingly. Developers emphasize open data more than algorithms.


  1. Kaggle, Data Science Use cases. An outline of business applications. Few companies have the data to implement these things.
  2. Kaggle, Competitions. (Make sure you chose “All Competitions” and then “Completed”.) Each competition has a leaderboard. When users publish their solutions on GitHub, you can find links to these solutions on the leaderboard.

Industrial solutions are more powerful and complex than these examples, but they are not publicly available. Data-driven companies post some details about this work in their blogs.

Emerging applications

Various prediction and classification problems. For ML research, see the last section.

  1. Stanford’s CS229 Course, Student projects. See “Recent years’ projects.” Hundreds of short papers.
  2. CMU ML Department, Student projects. More advanced problems, compared to CS229.


A tree of ML algorithms:


Econometricians may check the math behind the algorithms and find it familiar. Mathematical background:

  1. Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning. Standard reference. More formal approach. [free copy]
  2. James et al., An Introduction to Statistical Learning. Another standard reference by the same authors. More practical approach with coding. [free copy]
  3. Kaggle, Metrics. ML problems are all about minimizing prediction errors. These are various definitions of errors.
  4. (optional) Mitchell, Machine Learning. Close to Hastie, Tibshirani, and Friedman.

For what makes ML different from econometrics, see chapters “Model Assessment and Selection” and “Model Inference and Averaging” in The Elements.

Handy cheat sheets by KDnuggets, Microsoft, and Emanuel Ferm. Also this guideline:


Software and Hardware

Stata does not support many ML algorithms. Its counterpart in the ML community is R. R is a language, so you’ll need more tools to make it work:

  1. RStudio. A standard coding environment. Similar to Stata.
  2. CRAN packages for ML.
  3. James et al., An Introduction to Statistical Learning. This text introduces readers to R. Again, it is available for free.

Python is the closest alternative to R. Packages “scikit-learn” and “statsmodels” do ML in Python.

If your datasets and computations get heavier, you can run code on virtual servers by Google and Amazon. They have ML-ready instances that execute code faster. It takes a few minutes to set up one.


I limited this survey to economic applications. Other applications of ML include computer vision, speech recognition, and artificial intelligence.

The advantage of ML approaches (like neural networks and random forest) over econometrics (linear and logistic regressions) is substantial in these non-economic applications.

Economic systems often have linear properties, so ML is less impressive here. Nonetheless, it does predict things better, and more of practical solutions get done in the ML way.

Research in Machine Learning

  1. arXiv, Machine Learning. Drafts of important papers appear here first. Then they got published in journals.
  2. CS journals. Applied ML research also appear in engineering journals.
  3. CS departments. For example: CMU ML Department, PhD dissertations.

Consistency in Data Science

I took Kaggle competitions to measure internal validity in data science. Validity is an issue because it’s easy to get good predictions with a long feature list. Researchers know about this problem, managers don’t. But managers have the data and researchers don’t. Fortunately, managers release it into the wild sometimes, like for those Kaggle competitions. So let’s look if predictions based on this data remain consistent.

Kaggle has a handy rule for detecting overfitting:

Kaggle competitions are decided by your model’s performance on a test data set. Kaggle has the answers for this data set, but withholds them to compare with your predictions. Your Public score is what you receive back upon each submission (that score is calculated using a statistical evaluation metric, which is always described on the Evaluation page). BUT: Your Public Score is being determined from only a fraction of the test data set — usually between 25-33%. This is the Public Leaderboard, and it shows some relative performance during the competition.

When the competition ends, we take your selected submissions (see below) and score your predictions against the REMAINING FRACTION of the test set, or the private portion. You never receive ongoing feedback about your score on this portion, so it is the Private leaderboard. Final competition results are based on the Private leaderboard, and the Winner is the person(s) at the top of the Private Leaderboard.

Teams can’t win by submitting a lucky model that did well on the public set. Like, if you make a million models with different parameters and then choose the best fit. Instead, consistent solutions must perform well on both public and private sets. This is the validity that makes the model useful.

I scrapped both public and private leaderboards from 165 competitions. Correlation between the public and private scores for popular competitions:


Perfectly consistent solutions would have similar scores on both public (horizontal axis) and private (vertical axis) leaderboards. We would see a straight line. It’s not very straight in some competitions. Points moving away from the diagonals say that solutions don’t digest the new data well and their predictive power is declining.

Correlation for places:


This plot is illustrative for individual skills. When a data scientist gets a high score by luck, he won’t retain the position on the private leaderboard. Otherwise, he retains the position and if others do well, they form a straight line. We again don’t see the straight line in some cases.

What are those “some cases”? One is “restaurant revenue prediction“: predicting revenues for restaurants given geography and demographics. That’s a typical business problem in the sense that the company has few observations and many determinants. Data analysis can’t help here until the company gets the data on thousands of other restaurants. McDonald’s or Starbucks can get more, smaller chains can’t.

The Analytics Edge” competition is the MIT’s course homework for predicting successful blog posts also suffers from too many factors affecting the outcome.

Sometimes limitations exist by design. Kaggle is running a stock price prediction competition now, but the suggested data can’t do the job. Algorithmic trading relies on handpicked cases with unique data models, and the competition offers just the opposite.

How the same data scientists perform across different competitions:


Yes, we should find more straight lines, but they are not here. Instead, there are dense spots around the bottom left corners. Those are teams that broke into the top 100 on many occasions. They sort of did well without domain knowledge. However, when detected, experts did very well, as in this competition sponsored by an Internet search company.

Many problems remain unfriendly to quants, so solutions may be valid but not powerful. It can be fixed with more information, but other approaches often take over. For example, insiders remain the best investors in the restaurant business. A person runs a local restaurant for ten years. He knows the competitors, prices, costs, margins, clients. Of course, he is a better investor than the chain owner, even if the chain owner has a formal model. Markets work well here and centralized analysis don’t.

Research Is as Good as Its Reproducibility

Complex systems happen to have probabilistic, rather than deterministic, properties, and this fact made social sciences look deficient next to the real hard sciences (as if hard sciences predicted weather or earthquakes better than economics predicts financial crises).

What’s the difference? When today’s results differ from yesterday’s results, it’s not because authors get science wrong. In most cases, these authors just study slightly different contexts and may obtain seemingly contradictory results. Still, to benefit from generalization, it’s easier to take “slightly different” as “the same” and treat the result as a random variable.

In this case, “contradictions” get resolved surprisingly simply: by replicating the experiment and collecting more data. In the end, you have a distribution of the impact over studies, not simply of the impact within a single experiment.

Schoenfeld and Ioannidis show the dispersion of results in cancer research (“Is everything we eat associated with cancer?”, 2012):


Each point indicates a single study that estimates how much a given ingredient may contribute to getting cancer. The bad news: onion is more useful than bacon. The good news: we can say that a single estimate is never enough. A single study is not systematic, even after a peer review.

The recent attempt to reproduce 100 major studies in psychology confirms the divergence: “A large portion of replications produced weaker evidence for the original findings.” In this case, they also found a bias in reporting.

Economics also has reported effects varying across papers. By Eva Vivalt (2014):


This chart reports how conditional cash transfers affect different outcomes, measured in standard deviations. Cash transfers exemplify the rule: The impact is often absent, otherwise it varies (sometimes for the worse). For more, check this:

  • AidGrade: Programs by outcomes. A curated collection of popular public programs with their impact compared across programs and outcomes.
  • Social Science Registry. Registering a randomized trial in advance reduces the positive effect bias in publications and saves data-mining efforts by economists when nothing interesting comes out of the economist’s Plan A.

The dispersion of the impact is not a unique feature of randomized trials. Different estimates from similar papers appear elsewhere in economics. It’s most evident in literature surveys, especially those with nice summary tables: Xu, “The Role Of Law In Economic Growth”; Olken and Pande, “Corruption in Developing Countries”; DellaVigna and Gentzkow, “Persuasion.”

The problem, of course, is that the evidences are as good as their reproducibility. And reproducibility requires data on demand. But how many authors can claim that their results can be replicated? A useful classification by Levitt and List (2009):


Naturally-occurring data occurs naturally, so we cannot replicate it at will. A lot of highly cited papers rely on the naturally occurring data from the right-hand side methods. That’s, in fact, the secret. When an author finds a nice natural experiment that escapes accusations of endogeneity, his paper becomes an authority on the subject. (Either because natural experiments happen rarely and competing papers aren’t appearing, or because the identification looks so elegant that the readers fall in love with the paper.) But this experiment is only one point on the scale. It doesn’t become reliable just because we don’t know where the other points would be.

The work based on controlled data gets less attention, but this work gives a systematic account of causal relationships. Moreover, these papers cover the treatments of a practical sort: well-defined actions that NGOs and governments can implement. This seamless connections is a big burden, since taping “naturally-occurring” evidences to policies adds another layer of distrust between policy makers and researchers. For example, try to connect this list of references in labor economics to government policies.

Though to many researchers “practical” is an obscene word (and I don’t emphasize this quality), reproducible results are a scientific issue. What do reproducible results need? More cooperation, simpler inquiries, and less reliance on chance. More on this is coming.

Working Hours and Productivity in the United States

When East Asian countries grew at record rates, some articles attributed this to factor accumulation (eg Krugman 1994). Indeed, Japan and South Korea reinvested a lot of their output and also benefited from the growing working-age population. The data showed that factor accumulation actually went along with productivity growth, so these economies did have “genuine” improvements in the end.

Now, twenty years later, the same can be said about the United States. But this time, instead of capital, labor input drives economic growth. In 1950, the countries that would be called G7 looked this way (all data from PWT8, OECD):


US workers had relatively short working hours and much more equipment than their colleagues in other countries. In 2010, the picture looks different:


Hours declined rapidly in all countries but the United States. To feel the difference:


With the typical disclaimer about comparing hours across economies, I’d rather emphasize the dynamics of changes, instead of comparing countries directly. The growth paths for regional leaders:


These lines just smooth annual observations along 1950–2011. I also added GDP per worker under the markers.

Overall, if German firms cut hours by 40% since 1950, US firms cut only by 10%. Working hours stopped declining in the US around 1980 (perhaps to offset stagnating real incomes). Regardless of which counterfactual you like more (the US trend before 1980 or Germany’s), it implies a substantial difference in output — fueled by labor input, just as capital input helped East Asian economies decades ago.

Software as an Institution

The rules of the game, known to economists as institutions and to managers as corporate culture, usually entail inoperable ideas. That is, any country or business has some rules, but these rules coincide neither with optimal rules nor with leadership vision. Maybe with an exception of the top decile of performers or something like this.

This inoperability isn’t surprising since the rules have obscure formulations. Douglass North and his devotees did best at narrowing what “good institutions” are, but with North’s bird-eye view, you also need an ant-eye view on how changes happen.

An insider perspective had been there all the time, of course. Organizational psychology and operations management organized many informalities happening in firms. In general, we do know something about what managers should and shouldn’t do. Still, many findings aren’t robust as we’d like them to be. There’s also a communication problem between researchers and practitioners, meaning neither of the two cares what the other is doing.

These three problems—formulation, coverage, and communication of effective rules—have an unexpected solution in software. How comes? Software defines the rules.

Perhaps Excel doesn’t create such an impression, but social networks illustrate this case best. After the 90s, software engineers and designers became more involved in the social aspects of their products. Twitter made public communications shorter and arguably more efficient. In contrast to anonymous communities of the early 2000s, Facebook insisted on real identities and secure environment. Instagram and Pinterest focused users on sharing images. All major social networks introduced upvotes and shares for content ranking.

Governance in online communities can explain success of StackExchange and Quora in the Q&A space, where Google and Amazon failed. Like Wikipedia, these services combined successful incentive mechanisms with community-led monitoring. This monitoring helped dealing with low-quality content that would dominate if these services simply grew the user base, as previous contenders tried.

Wikipedia has 120,000 active editors, which is about twice as many employees as Google has (or alternatively, twelve Facebooks). And the users under the jurisdiction of major social networks:


So software defines the rules that several billion people follow daily. But unlike soft institutions, the rules engraved in code are very precise. Much more so than institutional ratings for countries or corporate culture leaflets for employees. Code-based rules also imply enforcement (“fill in all fields marked with ‘*'”). Less another big issue.

Software captures the data related to the impact of rules on performance. For example, Khan Academy extensively uses performance tracking to design the exercises that students are more likely to complete — something that schools with all the experienced teachers do mostly through compulsion.

Finally, communication between researchers and practitioners becomes less relevant because critical decisions get made at the R&D stage. Researchers don’t have to annoy managers in trenches because software already contains the best practices. Like at that employed algorithms to grant its employees access privileges based on the past performance.

These advantages make effective reproducible institutions available to communities and businesses. That is, no more obscure books, reports, and blog posts about best practices and good institutions. Just a product that does specific things, backed by robust research.

What would that be? SaaI: software as an institution?