The Do's and Don'ts When Handling Data

Sep 19, 2021 · 587 words · 3-minute read #data #social science research #continuously updated

This post summarizes the do’s and don’ts in data manipulation and analysis. Some entries are inspired by personal experience, while others come from replication papers and journal retractions. The post will be continuously updated as I’m bound to make more mistakes.


Data Entry

  1. Don’t use colors, fonts, etc. to encode information
    • I’ve seen datasets where the person who entered the data encoded information in font color, cell color, and font family. While these attributes could be recovered in modern statistical software, it would be much easier if all information is encoded in text. Use the “select and drag” feature to speed up data entry
  2. Paste values, not formulas
    • Most of the time we want to paste values, not formulas. Select “paste special” and paste values
  3. Don’t let Excel turn strings into dates
    • The trick I use is to select all cells and format them as text before I start entering data in an Excel spreadsheet

Data Manipulation

  1. Don’t code NAs as 0s
    • Some legacy statistical software may even automatically turn NAs into 0s!
  2. Make sure that outcome variables are not included as predictors
    • We may create transformed versions of our outcome variable (e.g. from continuous to binary, from unscaled to scaled, from non-log to log). If the outcome variable in its original form is left in the dataset as a predictor, our model performance would be inflated (aka data leakage). (Solution: Plot the factor analysis!)
  3. Don’t include future data in the training set when making timeseries predictions
    • Randomly sampling rows to create a training set introduces look-ahead bias
  4. Impute missing data using the training set only, not the entire dataset (data leakage)
  5. Remember to reverse the scale for certain Likert questions
    • Sometimes 1 = Good and 10 = Bad; sometimes 1 = Bad and 10 = Good
    • Additive indices may have subcomponents that are measured in different directions (To avoid acquiescence bias, it’s preferable to respondents their level of agreement with similar statements phrased in opposite directions)

Online Survey and Experiments

  1. Randomize the order of treatments (if each respondent is given multiple treatments)
  2. Communicate with co-authors to make sure that you use the same labeling and numbering systems throughout a project
    • Document variable coding every step of the way as you may have to return to this project after a months-long hiatus
  3. Check the “back” button
    • Sometimes we want it disabled
  4. Check the “force choice” option for each question
    • Sometimes we don’t want respondents to skip a question

Data Analysis

Descriptive Statistics

  1. Plot the distribution of every variable, especially if you didn’t collect the data yourself
    • I use the dfSummary function from summarytools, which summarizes all variables at once and deals with all data types
  2. Draw x and y on a scatterplot – Is their relationship driven by a few extreme values?

Instrumental Variables (IV)

  1. Placebo check: Use substantive knowledge to find a subsample where the instrument does not correlate with treatment assignment (Zero-First-Stage). We should find that the coefficient of Y~Z is 0
  2. Weak instrument test: Calculate and report the first-stage F statistic and the 2SLS z-score; remember to cluster at the right level when reporting bootstrapped SEs
  3. Be skeptical if the IV estimates are much bigger than the OLS estimates

Regression Discontinuity (RD)

  1. Abandon the project if it’s underpowered! The sample size for observational RDs is often too small (e.g. the number of close elections)
  2. Use rdrobust to automatically select bandwidths

Panel Data / Differences-in-Differences / Two-Way Fixed Effects

[Section under construction]