Improving Causal Inference with Text as Data in Empirical IS Research: A Machine Learning Approach
The 48th International Conference on Information Systems, 2020
9 Pages Posted: 19 Dec 2020
Date Written: October 21, 2020
Abstract
This study combines two streams of literature – text representation and machine learning-based causal inference, to study how to represent text as data to improve causal inference, i.e., estimating treatment effects more accurately. We choose a real problem context, Yelp reviews, to demonstrate how to train a topic modeling or Word2Vec model to transform review text into meaningful metrics and the causal forest to estimate the treatment effect of an ‘Elite’ badge recognized by Yelp on received votes of the review. Results show that the estimated average treatment effect (ATE) significantly decreases after adding quantitative text representations into the model. This implies that the positive effect of ‘Elite’ badge was overestimated without text information. We also present specific steps to help other researchers leverage the causal forest to estimate the heterogeneous effects across subgroups. Overall, we show that transforming text into quantitative data makes the treatment effect estimation more accurate.
Keywords: Causal Inference, Heterogeneous Treatment Effect, Text Representation, NLP, Machine Learning, Online Reviews
JEL Classification: C1, M16
Suggested Citation: Suggested Citation