Improved Data Collection from Online Sources Using Query Expansion and Active Learning

39 Pages Posted: 29 Aug 2017 Last revised: 22 Oct 2017

See all articles by Fridolin Linder

Fridolin Linder

New York University (NYU) - Social Media and Political Participation (SMaPP) Lab

Date Written: August 25, 2017

Abstract

Datasets derived from searching online textual sources, such as social media sites and news article repositories are increasingly used in political science research. Common approaches for retrieving such data are mostly based on keyword queries, and lack systematic evaluation of the quality of the retrieved sample. Based on the framework proposed in Li et al. (2014) I propose a methodology that combines approaches from machine learning and natural language processing to improve the identification of relevant data in large text corpora, while minimizing the required amount of human supervision. It consists of two steps. First, a larger set of data is retrieved from the total population using keywords. In the second step, a machine learning approach is taken to separate the initial set into relevant and irrelevant tweets. Information from the labeled data is then used to suggest additional keywords to expand the initial query. I evaluate the approach in a case study, retrieving Tweets about the German refugee crisis from a large dataset of German language Tweets. The proposed approach provides increased precision and recall as well as substantive representativeness, compared to commonly applied data retrieval strategies. I additionally provide software that implements the algorithm specifically for Twitter and makes it accessible for applied researchers.

Keywords: Active learning, query expansion, information retrieval, social media, Twitter

undefined

Suggested Citation

Linder, Fridolin, Improved Data Collection from Online Sources Using Query Expansion and Active Learning (August 25, 2017). Available at SSRN: https://ssrn.com/abstract=3026393 or http://dx.doi.org/10.2139/ssrn.3026393

Fridolin Linder (Contact Author)

New York University (NYU) - Social Media and Political Participation (SMaPP) Lab ( email )

New York, NY
United States

0 References

    0 Citations

      Do you have a job opening that you would like to promote on SSRN?

      Paper statistics

      Downloads
      282
      Abstract Views
      1,840
      Rank
      227,772
      PlumX Metrics
      Plum Print visual indicator of research metrics
      • Citations
        • Citation Indexes: 4
      • Usage
        • Abstract Views: 1832
        • Downloads: 282
      • Captures
        • Readers: 8
      see details