ChatGPT-4 Versus Decision Analyst’s Deep Learning Model By John V. Colias, Ph.D.
With all the hype and hoopla over generative AI, we decided to do some experimental work to see how well ChatGPT-4 performed versus Decision Analyst’s Deep Learning Model, a multi-layer neural network classification model.
ChatGPT evolved out of deep learning methods that appeared in the early 2010’s. Since then, deep learning models have achieved prominence in image classification, speech recognition, improved search results on the web, and more recently, the ability to understand text questions and to answer in natural language (“prompt” and the “response”).
Our investigation addresses the important question of how should generative AI models like ChatGPT be used in the marketing research industry? More specifically, how well do ChatGPT responses align with human-produced responses? Is additional modeling needed to further align ChatGPT and human-produced responses?
In survey-based research, open-ended questions are frequently included, and the responses to these open-ended questions are typically coded by hand. That is, a real-live person (an analyst) reads each of the answers to open-ended questions and assigns a numeric code (sometimes called a label) to each unique idea in the text answer. This process is also called Content Analysis, a widely used analytical method favored by intelligence agencies around the world to mine a deeper understanding of content published by competitive countries. Survey open-ended questions are used in marketing research, social research, and political research, and all the answers must be coded (or labeled) by a thinking, intelligent human being.
Human coding of open-ended responses is labor intensive and very expensive, so we decided to see if ChatGPT-4 could accurately assign codes (or labels) to the answers to open-ended questions. As a point of comparison, we used a Decision Analyst Deep Learning Model to code the same dataset.
We asked Nuance (Decision Analyst's coding and text analytics subsidiary) to assign human codes (or labels) to 2,000 answers for the open-ended question: “In your opinion, what are the major economic problems in your country? Please give as much detail as possible.” The data included responses from the US, Canada, India, UK, Australia, New Zealand, and the Netherlands. The responses were all provided in English.
Then, we supplied ChatGPT-4 with the text of the human-produced codes (the prompt) and asked it to assign codes or labels to the same 2,000 answers. Next, we trained our Deep Learning Model on a random sample of 500 human-coded answers, and then randomly selected 1,000 human-coded answers from the 2,000 dataset (excluding the 500 records chosen as the training dataset). Our Deep Learning Model then coded the answers in these same 1,000 records. ChatGPT-4 and Decision Analyst’s Deep Learning Model yielded the following results for the same 1,000 answers. The percentages in the following chart assume that the human-coded results are the “Gold Standard” (that is, 100% correct).
领英推荐
Clearly, ChatGPT-4 performed better than Decision Analyst’s Deep Learning Model. However, the gap in performance declined for codes or labels with higher incidence. For codes with at least 10% incidence, the sensitivity gap shifted from a significant "win" for ChatGPT-4 (51% versus 19%) to Decision Analyst's Deep Learning Model outperforming ChatGPT-4 by 3 percentage points (75% versus 72%).
It was expected that ChatGPT-4 would outperform Decision Analyst’s Deep Learning Model, since the former was developed using a massive number of texts to train it to understand the text responses and the meaning of the codes or labels. Indeed, the ChatGPT-4 embedding vectors, the numeric representation of the meaning of the text, provided most of the advantage. To demonstrate this point, we trained another Decision Analyst deep learning model to use ChatGPT-3.5 embedding vectors as predictors. This second deep learning model performed admirably and significantly outperformed ChatGPT-4 in sensitivity for higher-incidence codes.
None of the AI and deep learning systems are perfectly accurate. Neither system produces results that perfectly align with human-produced results.
We should point out that human-produced code assignments (given a code book also human produced) were assumed to be “truth.” One might suggest that humans could be subject to error and bias. On the other hand, one might suspect the same flaws in ChatGPT-3.5 and ChatGPT-4.
About the Author
John Colias ([email protected]) is a Senior VP Research & Development at Decision Analyst. He may be reached at 1-800-262-5974 or 1-817-640-6166.
Chief Executive Officer, Decision Analyst
6 个月Thanks, John. Very interesting findings. Jerry
Marketing Research Generalist | Helping Businesses Focus on the Right Questions...and Answers
6 个月Exciting work you are doing. I love hearing that we are researching this important topic. Improvements in our work are always needed...but interesting that Humans are still the best..for now.