Software for Researchers: New Data and Applications

The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn’t do the job.

I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.

Each section ends with a recommended reading list.

Standard Tools

LaTeX and DropBox streamline collaboration. The recommended LaTeX editor is LyX. Zotero and its browser plugin manage the references. LyX supports Zotero via another plugin.

Stata and Matlab do numerical computations. Both are paid, have good support and documentation. Free alternatives: IPython and RStudio to Stata, Octave to Matlab.

Mathematica does symbolic computations. Sage is a free alternative.

  1. Frain, “Applied LATEX for Economists, Social Scientists and Others.” Or a shorter intro to LaTeX by another author.
  2. UCLA, Stata Tutorial. This tutorial fits the economist’s goals. To make it shorter, study Stata’s very basic functionality and then google specific questions.
  3. Varian, “Mathematica for Economists.” Written 20 years ago. Mathematica became more powerful since then. See their tutorials.

New Data Sources

The most general source is the Internet itself. Scraping info from websites sometimes requires a permission (see the website’s terms of use and robots.txt).

Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and Import.io extract structured data from webpages. When they can’t, BeautifulSoup and similar parsers can.

Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.

Socrata, data.gov, quandl, FRED2 maintain the most comprehensive collection of public datasets. But the universe is much bigger, and exotic data hides elsewhere.

  1. Varian, “Big Data.”
  2. Glaeser et al., “Big Data and Big Cities.”
  3. Athey and Imbens, “Big Data and Economics, Big Data and Economies.”
  4. National Academy of Sciences, Drawing Causal Inference from Big Data [videos]
  5. StackExchange, Open Data. A website for data requests.

One Programming Language

A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.

Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.

Python comes as a standalone installation or in special distributions like Anaconda. For easier troubleshooting, I recommend the standalone installation. Use pip for package management.

Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore’s Law) and coder’s time gets more expensive.

Command line interfaces make massive operations on files easier. For Macs and other *nix systems, learn bash. For Windows, see cmd.exe.

  1. Kevin Sheppard, “Introduction to Python for Econometrics, Statistics and Data Analysis.”
  2. McKinney, Python for Data Analysis. [free demo code from the book]
  3. Sargent and Stachurski, “Quantitative Economics with Python.” The major project using Python and Julia in economics. Check their lectures, use cases, and open source library.
  4. Gentzkow and Shapiro, “What Drives Media Slant?” Natural language processing in media economics.
  5. Dell, “GIS Analysis for Applied Economists.” Use of Python for GIS data. Outdated in technical details, but demonstrates the approach.
  6. Dell, “Trafficking Networks and the Mexican Drug War.” Also see other works in economic geography by Dell.
  7. Repository awesome-python. Best practices.

Version Control and Repository

Version control tracks changes in files. It includes:

  • showing changes made in text files: for taking control over multiple revisions
  • reverting and accepting changes: for reviewing contributions by coauthors
  • support for multiple branches: for tracking versions for different seminars and data sources
  • synchronizing changes across computers: for collaboration and remote processing
  • forking: for other researchers to replicate and extend your work

Version control by Git is a de-facto standard. GitHub.com is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.

Sharing

Storage

A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.

When your project goes north of 1 GB, you can use GitHub’s Large File Storage or alternatives: AWS, Google Cloud, mega.nz, or torrents.

Demonstration

Jupyter notebooks combine text, code, and output on the same page. See examples:

  1. QuantEcon’s notebooks.
  2. Repository of data-science-ipython-notebooks. Machine learning applications.

Beamer for LaTeX is a standard solution for slides. TikZ for LaTeX draws diagrams and graphics.

Remote Server

Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.

If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.

A typical workflow with version control:

  1. Creating a Git repository
  2. Taking a small sample of data
  3. Coding and debugging research on a local computer
  4. Executing an instance on a remote server
  5. Syncing the code between two locations via Git
  6. Running the code on the full sample on the server

Some services allow writing code in a browser and running it right on their servers.

  1. EC2 AMI for scientific computing in Python and R. Read the last paragraph first.
  2. Amazon, Scientific Computing Using Spot Instances
  3. Google, Datalab

Real-time Applications

Real-time analysis requires optimization for performance. I exemplify with industrial applications:

  1. Jordan, On Computational Thinking, Inferential Thinking and Big Data. A general talk about getting better results faster.
  2. Google, Economics and Electronic Commerce research
  3. Microsoft, Economics and Computation research

The Map

A map for learning new data technologies by Swami Chandrasekaran:

6F57E263-545E-4E1D-B239-D01C17074A77
Source

 

One to n: Market Size, Not Innovations

In his popular Zero to One, Peter Thiel singles out original product development as the most important step for entrepreneurs to make. After that, “it’s easier to copy a model than to make something new. Doing what we already know how to do takes the world from 1 to n, adding more of something familiar.”

Of course, building a prototype is important. But it’s not the most important problem in the hi-tech industry. More often, startups passes the zero-to-one step trivially. They fail in what comes next: in going from one to n.

Right from the preface, Peter Thiel supports his thesis with the cases of Microsoft, Google, and Facebook. But these companies never went from zero to one. Their core products were invented and marketed by their predecessors. Unix was there ten years before Microsoft DOS release. AltaVista and Yahoo! preceded Google. LiveJournal had pioneered social networks five years before Mark Zuckerberg founded Facebook. Do a small research on any big company mentioned in the book’s index, and you’ll find someone else who did zero to one before the big and famous.

Now, there’s an obvious merit in what Microsoft, Google, and Facebook did. Reaching billions of customers is more difficult than being a pioneer. However, it principially changes the startup problem. Going from zero to one doesn’t make a great company. Going from one to n does.

And startups pay little attention to their one-to-n problem. Take the minimum: the product’s target market, the n itself. In their stylized business plans, founders routinely misestimate their ns by a few digits. For one example, developers of a healthy-lifestyle app equated this app’s market to all obesity-related spendings, including things like liposuction. Naturally, the number was large, but it wasn’t their n.

Many founders sacrifice several years of their lives to ideas with overestimated ns. Back to Thiel’s examples, Microsoft, Google, and Facebook knew their huge ns before their grew big. Moreover, they purposefully increased their ns by simplifying their products on the way. In the end, each human being with Internet access happened to be their potential (and often actual) customer.

What do other founders do, instead? They see a monster like Microsoft and run away from competition into marginal niches. A marginal niche leaves them with a small n, while requiring about the same several years of development. In fact, it’s cheaper to fail early with such a niche product because if a modest project survives, it distracts its founders from bigger markets. The project functions like a family restaurant: good people, nice place, but, alas, no growth.

How to escape competition right? For example, by building a path to a big market right from the start, as Y Combinator suggests when it welcomes a possible competitor to Google.

Here, Zero to One again may mislead if taken literally. The book’s emphasis on innovation and technology sidelines simple facts about successful companies. Successful companies are lazy innovators. In their early years, Microsoft, Google, and Facebook were too small to invest in serious innovations. They’ve been built on simple technologies. Google run on low-cost consumer hardware and Facebook was a simple content management system written on PHP in a few weeks. Common-sense creativity, not fancy innovations, supported these companies. While their simple initial products remain critical to business performance, their graveyard of failed zero-to-one innovations grows (look at Google’s).

The path to a big market is perpendicular to innovations. In the innovation scenario, founders become scientists who dig a single topic until the zero-to-one moment. Such as very advanced DeepMind, which was virtually unknown before Google’s acquisition. In the big market scenario, founders devote their attention to marketing, namely, how to earn new users and retain their loyalty. Often, this task is easier to complete with handwritten postcards to early adopters than spending years teaching a computer to recognize cat videos. And it’s clearly not a single zero to one step, but many steps back and forth, with the foreseeable n in mind.