5 Reasons Sensory and Consumer Scientists Should Learn (a Little) Data Science
In the fall of 2013, my life changed forever. I had just received the 2013 Food Quality and Preference "Researcher of the Future" award, consumer packaged goods (CPG) companies around the world were adopting the Tetrad test, and my consulting career was taking off as there was growing interest not only in the Tetrad test but also in the various combinatorial tools that we were championing at the Institute for Perception. What changed was that I agreed to provide statistical support on a series of surveys for regulatory engagement. While I was certainly qualified to give this support, given the numerous IRB-approved studies I had run during my post-doc and my knowledge of survey statistics, what I found was that the sheer size and scope of this research required an almost complete re-tooling of my workflow. Moreover, I saw as I collaborated on this research that there was a pressing need to catalog the precise definitions of highly-complicated variables and subgroups of interest. What I didn't realize I needed at the time, but what I became highly motivated to study and now sincerely appreciate, was data science.
Have you ever felt like this? You might need data science.
What is data science?
Many have debated the definition of data science, with the origins of data science going back at least to John Tukey's seminal work on data exploration, but I think the best answer is that data science is the offspring of statistics and computer science that has arisen in response to practical needs and has become possible because of advances in computing and algorithmic power. As John Wills proposed:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
Culturally, data science differs from computer science in much the same way statistics differs from mathematics - in each case, the first discipline uses tools from the second to answer questions about the real world. Similarly, differences between data science and statistics mirror the differences between computer science and mathematics - in each case, the first discipline is more computationally oriented, with the result that data science is more interested in prediction and less interested in model fitting and parameter estimation than is traditional statistics.
But why learn data science?
At this point, you might be thinking, "Well, John, that's very interesting, but so what?" Fortunately, there is a "so what"! In fact, there area at least five "so what"s, all related in various ways to the increasing demands on sensory and consumer scientists as the speed of business compounds on itself and the corresponding amount of data we must handle increases exponentially.
1. Transformation
When we receive data, there are typically several operations required for cataloging, cleaning, correcting, and reshaping the data to make it suitable for downstream processing. If you take away one idea from this article, please let it be the lesson that, if you're transforming your data manually in any way, such as in Excel, there are (much) better tools available. I'm a fan of the tidyverse constellation of tools in R, but there are many other data scientific tools that are too numerous to name here. These tools not only allow you great flexibility for preparing your data for analysis (a process sometimes called "data wrangling"), they also allow you to record the precise steps taken in this process. This record empowers you to know later with certainty which steps you made, to reproduce those steps on other datasets, and to use those steps as the jumping off point for similar but not identical processing in the future.
2. Exploration
Data exploration has two main branches - descriptive statistics and data visualization. For the first of these two branches, the facility with which data scientific tools can transform data - quickly examining data cross-sections and combinations of variables - means that data can be easily examined for trends and, with caution, to suggest hypotheses. Moreover, the wealth of exploratory techniques available within open-source communities such as the R community allows one to access more advanced tools such as exploratory factor analysis easily.
For the second branch of data exploration - data visualization - data science supports essentially infinite customization to examine data from any conceivable angle. I'm enthusiastic about the use of ggplot2 and other "gg" tools to create beautiful and easy-to-read charts such as the two charts shown below (from menu research for my son's preschool), but other data scientific platforms similarly support the production of customizable and attractive graphics, such as those shown below:
3. Automation
Although the data scientific benefits we've described are considerable, where data science most stands out from traditional approaches is when identical actions need to be taken many times for different subsets of data or for different subsets of variables. In fact, the tools for batch processing are so powerful one must take special precautions against the so-called look-elsewhere effect (or vast search effect) to avoid being seduced by spurious correlations.
On the positive side of automation, one can make use of such tools as the modelr package in R to keep data, models, results, and even charts organized in a single table for easy access. Moreover, data scientific tools become especially powerful when tools for automated analysis get combined with tools for automated reporting, such as those supplied by the knitr and the officer packages in R. While I appreciate and use knitr frequently, I find the officer package to be especially valuable as it allows for almost total control of reports crafted in either Word or PowerPoint. This last point takes us to yet another advantage of data science - communication.
4. Communication
Over the last several years I've been fortunate to have supported consumer research within the technology sector, during which time I've seen firsthand the emphasis placed by technology companies on design. As a result of this work, I've come to appreciate the need for excellent design and have become a student of design myself. I've also come to believe that one factor that holds sensory and consumer scientists back from maximum impact is a shortage of design principles within their communications with marketing (and management more generally).
Fortunately, communication is one area in which data science can greatly assist sensory and consumer scientists. Because of the automation of analysis and the ability to write reusable scripts for future reporting, data science frees up time for sensory and consumer scientists to craft excellent reports and presentations. Moreover, because of the high level of customization offered by data scientific tools, choices that are too often ignored - such as fonts and color palettes - can be given the attention they deserve consistently. Finally, because a well-written data scientific script will automate the bulk of the analysis, sensory and consumer scientists using these tools have time to ask themselves essential questions such as, "What is the point I'm trying to make?", "What decisions do I want my audience to make?", and "What action do I want my audience to take?".
5. Replication
Putting all of the ideas in this post together, we arrive at the fifth benefit of data scientific tools, which is reproducible research. This benefit is the one that I came to most appreciate while involved in statistics for regulatory engagement. Of course, regulators would like to know the exact steps taken in an analysis, including the steps used to prepare the data, but they're not the only ones. Perhaps the most important person who needs to know what you did exactly during an analysis is future-you and, as Hadley Wickham said,
... you’re actually always collaborating with future-you; and past-you doesn’t respond to emails.
Hence the ideal situation to be in as a scientist is one where you can start from the original dataset and reproduce the final report or presentation automatically. Once you are in this position, you can: answer any questions that you or anyone else might have about your work, create a new report easily should the data change slightly, start new work on similar datasets with little to no effort. Also, if your data are consistently arriving in the same format, you can create your reports automatically and spend any available time further improving and reproducibly refining your reporting. Over time, you can amass a library of "recipes" that you draw from for new work in addition to being able to reproduce any work from the past quickly.
How to start?
As with anything new, learning data science can be intimidating at first. When I attended my first data science conference, I was intimidated despite having spoken many times at international conferences in my own field. But now, this year, I'll be a speaker at the Symposium on Data Science and Statistics. So things are learnable over time - the main thing is to get started. To that end, I recommend three paths to pursue concurrently.
The first path I recommend is to work through R for Data Science by Hadley Wickham. I've typed every line of code in this book, and it's still the first place I go if I have a simple problem. This book is available for free online, but I still find it more satisfying to work from an actual book. In fact, as a side note, Hadley Wickham was kind enough as to sign my well-worn copy when I was at the recent Conference on Statistical Practice:
One of my prized possessions.
The second step I would recommend is online learning. Of the many options available, I would prioritize the interactive learning models on DataCamp. On DataCamp, you can complete individual courses to your liking or can sign up for tracks consisting of several courses. I've completed many courses on DataCamp and have found it very helpful. Another recommended online option is to learn through massive open online courses (MOOCs) such as those found on Coursera.org. I've found the big picture courses such as the Executive Data Science Specialization to be the most helpful of the courses there. Finally, it's useful to get skilled at searching message boards such as StackOverflow and Cross Validated on Stack Exchange although, honestly, typing a question into Google is the same thing at this point. Learning a few simple Google search tricks is also worth a few minutes.
The third step is to find a mentor or at least a colleague to join you on your journey. If you work in a company with a data science or a statistics group, reach out to your internal colleagues. They will be excited that someone is interested in their nerdy fun and will be happy to help you. Many cities also have MeetUp groups dedicated to data science - meetups can be a neat way to meet new people and learn something new. Finally, talk to your sensory and consumer science colleagues and lobby for educational opportunities. With the wave of automation headed our way, it will be important for all professionals to have some level of comfort with techniques arising in computer science as well as high-level communication skills. Fortunately, data science helps on both fronts.
I hope the sensory and consumer science community will become excited about data science and will become more successful than ever in the coming new machine age as it embraces these tools. And, if you'd like to receive weekly video recaps of the Aigora blog activity, just click on the button below to add your email to our mailing list. Thanks for stopping by!
Thanks for reading!
This article was originally published on the Aigora blog - for weekly video recaps of all of the Aigora blog activity, be sure to subscribe now.