How Using Big Data Can Improve Survey Sample Efficiency

How Using Big Data Can Improve Survey Sample Efficiency

By David Dutwin


Big Data and machine learning have been areas of great interest to survey researchers for more than a decade. My NORC colleagues Patrick Coyle , Josh Lerner , Ipek Bilgen, Ph.D. , Ned English , and I delve into these topics in a new article published in the Journal of Survey Statistics and Methodology. We examine how to leverage both Big Data and machine learning to develop a tool that allows sample to be effectively targeted toward households likely to have specific demographics such as education levels, income, age, etc. This allows survey researchers to more effectively and comprehensively target key populations of interest, lowering costs and making surveys of low-incidence populations more accessible.

We compared this approach to traditional sampling strategies, namely geographic clustering, whereby sample is focused toward areas high in incidence of a given population (e.g., neighborhoods that are predominantly Hispanic) and the use of vendor flags, such as an indicator of Hispanic ethnicity furnished by consumer data providers.?

Our findings showed that in most cases, Big Data and machine learning models offer increased effectiveness at being able to identify households with specific characteristics.?

The models leverage the ability of machine learning tools to take many, sometimes thousands, of inputs to make predictions on the likelihood of any given household in the U.S. to have a certain attribute, from Hispanic ethnicity to the likelihood to have a smoker or a specific religious identity. The only limitations are the availability of a large training dataset with self-reported data on a given attribute, and the ability of the machine learning algorithm to make a reliable prediction based on the thousands of Big Data indictors one can gather from a variety of sources.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了