Data Acquisition and Sampling


The key problem in applying computer scientists’ data selection methods to investigate a social science phenomena is that unlike a pre-planned survey or observational study, the collection of organic data is typically done through APIs, third party companies or web scraping without full knowledge or control over the frame from which the samples are generated or explicit description of the algorithmic selection properties. Moreover, most researchers who mine social media data do not understand the characteristics of the individuals in their sample or the mismatch between their sample and the target population. Adjusting for this mismatch between the social media population and the research target population is not straightforward since many social media platforms do not have accessible specified fields and do not require users to fill in their socio-demographics or other characteristics that are essential for understanding the pool from which the participants originate from. While work exists, standards need to be established across disciplines to understand the samples researchers are using, sampling frame coverage, sampling procedure, sample design features and size, and population mismatch adjustments.