Software for Researchers: New Data and Applications

The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn’t do the job.

I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.

Each section ends with a recommended reading list.

Standard Tools

LaTeX and DropBox streamline collaboration. The recommended LaTeX editor is LyX. Zotero and its browser plugin manage the references. LyX supports Zotero via another plugin.

Stata and Matlab do numerical computations. Both are paid, have good support and documentation. Free alternatives: IPython and RStudio to Stata, Octave to Matlab.

Mathematica does symbolic computations. Sage is a free alternative.

  1. Frain, “Applied LATEX for Economists, Social Scientists and Others.” Or a shorter intro to LaTeX by another author.
  2. UCLA, Stata Tutorial. This tutorial fits the economist’s goals. To make it shorter, study Stata’s very basic functionality and then google specific questions.
  3. Varian, “Mathematica for Economists.” Written 20 years ago. Mathematica became more powerful since then. See their tutorials.

New Data Sources

The most general source is the Internet itself. Scraping info from websites sometimes requires a permission (see the website’s terms of use and robots.txt).

Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and Import.io extract structured data from webpages. When they can’t, BeautifulSoup and similar parsers can.

Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.

Socrata, data.gov, quandl, FRED2 maintain the most comprehensive collection of public datasets. But the universe is much bigger, and exotic data hides elsewhere.

  1. Varian, “Big Data.”
  2. Glaeser et al., “Big Data and Big Cities.”
  3. Athey and Imbens, “Big Data and Economics, Big Data and Economies.”
  4. National Academy of Sciences, Drawing Causal Inference from Big Data [videos]
  5. StackExchange, Open Data. A website for data requests.

One Programming Language

A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.

Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.

Python comes as a standalone installation or in special distributions like Anaconda. For easier troubleshooting, I recommend the standalone installation. Use pip for package management.

Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore’s Law) and coder’s time gets more expensive.

Command line interfaces make massive operations on files easier. For Macs and other *nix systems, learn bash. For Windows, see cmd.exe.

  1. Kevin Sheppard, “Introduction to Python for Econometrics, Statistics and Data Analysis.”
  2. McKinney, Python for Data Analysis. [free demo code from the book]
  3. Sargent and Stachurski, “Quantitative Economics with Python.” The major project using Python and Julia in economics. Check their lectures, use cases, and open source library.
  4. Gentzkow and Shapiro, “What Drives Media Slant?” Natural language processing in media economics.
  5. Dell, “GIS Analysis for Applied Economists.” Use of Python for GIS data. Outdated in technical details, but demonstrates the approach.
  6. Dell, “Trafficking Networks and the Mexican Drug War.” Also see other works in economic geography by Dell.
  7. Repository awesome-python. Best practices.

Version Control and Repository

Version control tracks changes in files. It includes:

  • showing changes made in text files: for taking control over multiple revisions
  • reverting and accepting changes: for reviewing contributions by coauthors
  • support for multiple branches: for tracking versions for different seminars and data sources
  • synchronizing changes across computers: for collaboration and remote processing
  • forking: for other researchers to replicate and extend your work

Version control by Git is a de-facto standard. GitHub.com is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.

Sharing

Storage

A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.

When your project goes north of 1 GB, you can use GitHub’s Large File Storage or alternatives: AWS, Google Cloud, mega.nz, or torrents.

Demonstration

Jupyter notebooks combine text, code, and output on the same page. See examples:

  1. QuantEcon’s notebooks.
  2. Repository of data-science-ipython-notebooks. Machine learning applications.

Beamer for LaTeX is a standard solution for slides. TikZ for LaTeX draws diagrams and graphics.

Remote Server

Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.

If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.

A typical workflow with version control:

  1. Creating a Git repository
  2. Taking a small sample of data
  3. Coding and debugging research on a local computer
  4. Executing an instance on a remote server
  5. Syncing the code between two locations via Git
  6. Running the code on the full sample on the server

Some services allow writing code in a browser and running it right on their servers.

  1. EC2 AMI for scientific computing in Python and R. Read the last paragraph first.
  2. Amazon, Scientific Computing Using Spot Instances
  3. Google, Datalab

Real-time Applications

Real-time analysis requires optimization for performance. I exemplify with industrial applications:

  1. Jordan, On Computational Thinking, Inferential Thinking and Big Data. A general talk about getting better results faster.
  2. Google, Economics and Electronic Commerce research
  3. Microsoft, Economics and Computation research

The Map

A map for learning new data technologies by Swami Chandrasekaran:

6F57E263-545E-4E1D-B239-D01C17074A77
Source

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s