Software for Researchers: New Data and Applications

The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn’t do the job.

I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.

Each section ends with a recommended reading list.

Standard Tools

LaTeX and DropBox streamline collaboration. The recommended LaTeX editor is LyX. Zotero and its browser plugin manage the references. LyX supports Zotero via another plugin.

Stata and Matlab do numerical computations. Both are paid, have good support and documentation. Free alternatives: IPython and RStudio to Stata, Octave to Matlab.

Mathematica does symbolic computations. Sage is a free alternative.

  1. Frain, “Applied LATEX for Economists, Social Scientists and Others.” Or a shorter intro to LaTeX by another author.
  2. UCLA, Stata Tutorial. This tutorial fits the economist’s goals. To make it shorter, study Stata’s very basic functionality and then google specific questions.
  3. Varian, “Mathematica for Economists.” Written 20 years ago. Mathematica became more powerful since then. See their tutorials.

New Data Sources

The most general source is the Internet itself. Scraping info from websites sometimes requires a permission (see the website’s terms of use and robots.txt).

Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and Import.io extract structured data from webpages. When they can’t, BeautifulSoup and similar parsers can.

Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.

Socrata, data.gov, quandl, FRED2 maintain the most comprehensive collection of public datasets. But the universe is much bigger, and exotic data hides elsewhere.

  1. Varian, “Big Data.”
  2. Glaeser et al., “Big Data and Big Cities.”
  3. Athey and Imbens, “Big Data and Economics, Big Data and Economies.”
  4. National Academy of Sciences, Drawing Causal Inference from Big Data [videos]
  5. StackExchange, Open Data. A website for data requests.

One Programming Language

A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.

Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.

Python comes as a standalone installation or in special distributions like Anaconda. For easier troubleshooting, I recommend the standalone installation. Use pip for package management.

Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore’s Law) and coder’s time gets more expensive.

Command line interfaces make massive operations on files easier. For Macs and other *nix systems, learn bash. For Windows, see cmd.exe.

  1. Kevin Sheppard, “Introduction to Python for Econometrics, Statistics and Data Analysis.”
  2. McKinney, Python for Data Analysis. [free demo code from the book]
  3. Sargent and Stachurski, “Quantitative Economics with Python.” The major project using Python and Julia in economics. Check their lectures, use cases, and open source library.
  4. Gentzkow and Shapiro, “What Drives Media Slant?” Natural language processing in media economics.
  5. Dell, “GIS Analysis for Applied Economists.” Use of Python for GIS data. Outdated in technical details, but demonstrates the approach.
  6. Dell, “Trafficking Networks and the Mexican Drug War.” Also see other works in economic geography by Dell.
  7. Repository awesome-python. Best practices.

Version Control and Repository

Version control tracks changes in files. It includes:

  • showing changes made in text files: for taking control over multiple revisions
  • reverting and accepting changes: for reviewing contributions by coauthors
  • support for multiple branches: for tracking versions for different seminars and data sources
  • synchronizing changes across computers: for collaboration and remote processing
  • forking: for other researchers to replicate and extend your work

Version control by Git is a de-facto standard. GitHub.com is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.

Sharing

Storage

A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.

When your project goes north of 1 GB, you can use GitHub’s Large File Storage or alternatives: AWS, Google Cloud, mega.nz, or torrents.

Demonstration

Jupyter notebooks combine text, code, and output on the same page. See examples:

  1. QuantEcon’s notebooks.
  2. Repository of data-science-ipython-notebooks. Machine learning applications.

Beamer for LaTeX is a standard solution for slides. TikZ for LaTeX draws diagrams and graphics.

Remote Server

Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.

If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.

A typical workflow with version control:

  1. Creating a Git repository
  2. Taking a small sample of data
  3. Coding and debugging research on a local computer
  4. Executing an instance on a remote server
  5. Syncing the code between two locations via Git
  6. Running the code on the full sample on the server

Some services allow writing code in a browser and running it right on their servers.

  1. EC2 AMI for scientific computing in Python and R. Read the last paragraph first.
  2. Amazon, Scientific Computing Using Spot Instances
  3. Google, Datalab

Real-time Applications

Real-time analysis requires optimization for performance. I exemplify with industrial applications:

  1. Jordan, On Computational Thinking, Inferential Thinking and Big Data. A general talk about getting better results faster.
  2. Google, Economics and Electronic Commerce research
  3. Microsoft, Economics and Computation research

The Map

A map for learning new data technologies by Swami Chandrasekaran:

6F57E263-545E-4E1D-B239-D01C17074A77
Source

 

Software as an Institution

The rules of the game, known to economists as institutions and to managers as corporate culture, usually entail inoperable ideas. That is, any country or business has some rules, but these rules coincide neither with optimal rules nor with leadership vision. Maybe with an exception of the top decile of performers or something like this.

This inoperability isn’t surprising since the rules have obscure formulations. Douglass North and his devotees did best at narrowing what “good institutions” are, but with North’s bird-eye view, you also need an ant-eye view on how changes happen.

An insider perspective had been there all the time, of course. Organizational psychology and operations management organized many informalities happening in firms. In general, we do know something about what managers should and shouldn’t do. Still, many findings aren’t robust as we’d like them to be. There’s also a communication problem between researchers and practitioners, meaning neither of the two cares what the other is doing.

These three problems—formulation, coverage, and communication of effective rules—have an unexpected solution in software. How comes? Software defines the rules.

Perhaps Excel doesn’t create such an impression, but social networks illustrate this case best. After the 90s, software engineers and designers became more involved in the social aspects of their products. Twitter made public communications shorter and arguably more efficient. In contrast to anonymous communities of the early 2000s, Facebook insisted on real identities and secure environment. Instagram and Pinterest focused users on sharing images. All major social networks introduced upvotes and shares for content ranking.

Governance in online communities can explain success of StackExchange and Quora in the Q&A space, where Google and Amazon failed. Like Wikipedia, these services combined successful incentive mechanisms with community-led monitoring. This monitoring helped dealing with low-quality content that would dominate if these services simply grew the user base, as previous contenders tried.

Wikipedia has 120,000 active editors, which is about twice as many employees as Google has (or alternatively, twelve Facebooks). And the users under the jurisdiction of major social networks:

Source

So software defines the rules that several billion people follow daily. But unlike soft institutions, the rules engraved in code are very precise. Much more so than institutional ratings for countries or corporate culture leaflets for employees. Code-based rules also imply enforcement (“fill in all fields marked with ‘*'”). Less another big issue.

Software captures the data related to the impact of rules on performance. For example, Khan Academy extensively uses performance tracking to design the exercises that students are more likely to complete — something that schools with all the experienced teachers do mostly through compulsion.

Finally, communication between researchers and practitioners becomes less relevant because critical decisions get made at the R&D stage. Researchers don’t have to annoy managers in trenches because software already contains the best practices. Like at Amazon.com that employed algorithms to grant its employees access privileges based on the past performance.

These advantages make effective reproducible institutions available to communities and businesses. That is, no more obscure books, reports, and blog posts about best practices and good institutions. Just a product that does specific things, backed by robust research.

What would that be? SaaI: software as an institution?

How Google Works: Unauthorized Edition

Over the years Google earned a reputation as a unique workplace endlessly generating great innovations. This image of an engineering wonderland missed many important aspects of the company’s inners. You could expect Google’s management to be a bit more critical about this. But as Eric Schmidt’s new book How Google Works shows, it’s not the case. The book reestablishes all the major stereotypes, while paying little attention to the things that made up 91% of Google’s success.

Revenue: Auctions

The 91% is the share of revenue Google generates from advertising sold at the famous auctions occurring each time when someone opens a webpage. While an auction is an efficient way of allocating limited resources such as ad space, these ad auctions squeeze advertisers’ pockets in favor of the seller, that is, Google and its affiliates.

In economic terms, auctions eliminate consumer surplus:

Wikipedia
Wikipedia

That’s a “normal” market, when advertisers pay the equilibrium price. Instead, Google takes the entire surplus by selling ads in individual units—each for the maximum price advertisers would pay. The blue supply curve is nearly flat in this case, and the prices go along the red demand curve. Technically, advertisers pay the second highest price—the mechanism chosen by Google for stability (see generalized second-price auction and Vickrey auction)—but in intensive competition the difference between the first and second prices is small.

How does it work in practice? Suppose you are looking for a bicycle and just google it. When your AdBlock is off, you see something like this:

screenshot

Now, you click on “made-in-china.com,” buy whatever it sells, and have your bicycle delivered to you. Made-in-China.com pays about $2.72 to Google for you coming through this link (you can find prices for any search query in the Keyword Planner). This price is determined during the auction, when many bicycle sellers automatically submit their bids and ad texts attached to the bids.

The precise auction algorithm is more complex than just taking the highest bid, because the highest bid may include an ad that you won’t click on and the opportunity will be wasted. Also, since conversion rates are way below 100%, Made-in-China.com has to pay these $2.72 several times before a real buyer comes by. It increases the price of bicycles the website sells. Some insurance-related ads cost north of $50 each—all paid by insurance buyers in the end.

Though this mechanism would make no sense without users attracted by Google’s great search engine, the mechanism takes most out of customers—and transfers it to Google.

Retention: Monopoly

How does Google Search attract users? Well, first, by showing them relevant results. It sounds more trivial now than it was ten years ago. Now users expect Amazon.com to be the first link for almost any consumer good and Wikipedia for topics of general interest. These websites are considered the most relevant not because they’re the best in some objective sense, but again because of particular technologies that made Google so successful.

Larry Page and Sergey Brin’s key contribution to their startup was PageRank algorithm. PageRank is patented, but the underlying algorithms are easy to find in graph theory. The more links point to your website, the higher position your website gets in search results. When I google “PageRank,” I have Wikipedia’s article on the top. When I link to this article here, it becomes more likely that Wikipedia’s article will remain at the top. As a side effect, linking to the first page of Google results creates a serious competitive advantage for top websites. For Wikipedia, it may be a plus as more people concentrate on improving its pages. But strong positions in search results also secure Amazon.com’s monopoly in e-commerce.

Google’s search technologies are supported by its intensive marketing efforts in eliminating its competitors. Google paid Mozilla for keeping Google as its default search all along before Yahoo! outbid it in 2015. Four years ago, Eric Schmidt testified at Senate hearing about unfair competition practices by Google regarding search results allegedly biased in favor of Google services. The European Commission investigates Google’s practices in Europe. In mobile markets, Google demands from hardware manufacturers to install Google Mobile Services on all Android devices—so users go after their status quo bias and stay with Google everywhere.

There’re more fascinating examples of Google protecting its market share. They’re missing in Eric Schmidt’s book, which gives all credit to Google’s engineers and nothing to its lawyers and marketing people.

Development: Privileges

When a typical business creates something, managers carefully look after costs. They negotiate with suppliers, look for quality, build complex supply networks, balance payments, insure their company from price shocks. Google is the fifth largest company in the world, but it’s mostly free of these headaches. Unlike Walmart, ExxonMobil, or Berkshire Hathaway, Google employees make things out of thin air and outsource routines, like training its search engine, to third parties.

It ensures that even entry- and mid-level employees are extremely skillful. Not surprisingly, most Google legends concern its HR policies. These legends split into two categories: that make sense and that don’t.

The culture stuff is what makes no sense. It’s easy to see in non-policies like granting 20% of time to personal projects. This rule might mean something for car assembling jobs; but here it’s software development. An engineer’s personal projects may take 50% of the time if he’s done his daily job—or zero otherwise. It depends on his ability to deliver results expected from his salary. More importantly, his personal projects belong to Google, even if he delivers his daily projects in time but once edited his personal code at the campus.

The book also mentions the 70/20/10 rule: “70 percent of resources dedicated to the core business, 20 percent on emerging, and 10 percent on new.” Even if the authors could prove that the rule is optimal, most other companies are so limited in resources that they have to put 100 percent into the core business.

Neither real things make the Google culture different. Each employee must have a decent workplace, attention, and internal openness, but these things are not sufficient for a great company. We are not in Ancient Greece. Other companies also treat employees well: not much slavery around, meals are fine. Google just tends to be at the extreme.

Laszlo Bock, SVP of People Operations, tried to dissuade the public from thinking that good HR policies require Google’s profit margins. In his opinion, you can get much out of people with openness and good treatment alone. His examples include telling employees about sales figures. It’s sort of an alienated example. First, sales numbers aren’t always as optimistic as Google’s history. Ups and downs, you know. You have to learn how to communicate downs to employees and keep them optimistic.

Soberness appears in less fortunate startups. Evan Williams of Blogger had the moment when the money ran out and employees didn’t appreciate it: “Everybody left, and the next day, I was the only one who came in the office” (from Jessica Livingstone’s Founders at Work, a good, balanced account on early days at startups). It’s just one example that relationships with employees are not as trivial as Bock presents them.

If not culture, then what makes the difference? Quite trivially, the privileged access to job candidates. First, it’s not about money because Google easily outbids everyone else. Its entry-level wages surpass those of Wall Street firms, including major hedge funds, like Bridgewater Associates and Renaissance Technologies. Second, Google has the right of the first interview. That comes with exceptional reputation, low-stress jobs, secure employment, ambitious goals and resources to implement big ideas.

So What?

How Google Works understates the actual achievements of the company. The book is all about famous corporate rules making the business look simplistic. It’s not. The $360 bn business consists of hundreds of important details in each key operation, like hiring, marketing, and sales.

Keeping these things together is an achievement of Eric Schmidt, Laszlo Bock and other executives. However, Schmidt’s book should not mislead other entrepreneurs into thinking that the 20% rule creates great products and reporting sales numbers to employees increases sales better than ad auctions do. Google is a good role model for learning hardcore IT business, but readers will have to wait for some other book to learn from this company.

One to n: Market Size, Not Innovations

In his popular Zero to One, Peter Thiel singles out original product development as the most important step for entrepreneurs to make. After that, “it’s easier to copy a model than to make something new. Doing what we already know how to do takes the world from 1 to n, adding more of something familiar.”

Of course, building a prototype is important. But it’s not the most important problem in the hi-tech industry. More often, startups passes the zero-to-one step trivially. They fail in what comes next: in going from one to n.

Right from the preface, Peter Thiel supports his thesis with the cases of Microsoft, Google, and Facebook. But these companies never went from zero to one. Their core products were invented and marketed by their predecessors. Unix was there ten years before Microsoft DOS release. AltaVista and Yahoo! preceded Google. LiveJournal had pioneered social networks five years before Mark Zuckerberg founded Facebook. Do a small research on any big company mentioned in the book’s index, and you’ll find someone else who did zero to one before the big and famous.

Now, there’s an obvious merit in what Microsoft, Google, and Facebook did. Reaching billions of customers is more difficult than being a pioneer. However, it principially changes the startup problem. Going from zero to one doesn’t make a great company. Going from one to n does.

And startups pay little attention to their one-to-n problem. Take the minimum: the product’s target market, the n itself. In their stylized business plans, founders routinely misestimate their ns by a few digits. For one example, developers of a healthy-lifestyle app equated this app’s market to all obesity-related spendings, including things like liposuction. Naturally, the number was large, but it wasn’t their n.

Many founders sacrifice several years of their lives to ideas with overestimated ns. Back to Thiel’s examples, Microsoft, Google, and Facebook knew their huge ns before their grew big. Moreover, they purposefully increased their ns by simplifying their products on the way. In the end, each human being with Internet access happened to be their potential (and often actual) customer.

What do other founders do, instead? They see a monster like Microsoft and run away from competition into marginal niches. A marginal niche leaves them with a small n, while requiring about the same several years of development. In fact, it’s cheaper to fail early with such a niche product because if a modest project survives, it distracts its founders from bigger markets. The project functions like a family restaurant: good people, nice place, but, alas, no growth.

How to escape competition right? For example, by building a path to a big market right from the start, as Y Combinator suggests when it welcomes a possible competitor to Google.

Here, Zero to One again may mislead if taken literally. The book’s emphasis on innovation and technology sidelines simple facts about successful companies. Successful companies are lazy innovators. In their early years, Microsoft, Google, and Facebook were too small to invest in serious innovations. They’ve been built on simple technologies. Google run on low-cost consumer hardware and Facebook was a simple content management system written on PHP in a few weeks. Common-sense creativity, not fancy innovations, supported these companies. While their simple initial products remain critical to business performance, their graveyard of failed zero-to-one innovations grows (look at Google’s).

The path to a big market is perpendicular to innovations. In the innovation scenario, founders become scientists who dig a single topic until the zero-to-one moment. Such as very advanced DeepMind, which was virtually unknown before Google’s acquisition. In the big market scenario, founders devote their attention to marketing, namely, how to earn new users and retain their loyalty. Often, this task is easier to complete with handwritten postcards to early adopters than spending years teaching a computer to recognize cat videos. And it’s clearly not a single zero to one step, but many steps back and forth, with the foreseeable n in mind.