Do your strategy consultants beat ChatGPT?
Recently, a post on LinkedIn written by Aku Nikkola caught my attention.
On September 22, 2023, Fabrizio Dell'Acqua et al. published an interesting experiment, funded by the 美国哈佛商学院 , called Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.
The aim of the study was to understand how AI integration might reshape the traditional workflows of high human capital professionals.
Although the experiment design is, in my opinion, well-thought-out and understandable, I’m afraid there are some analytical weaknesses that make the study results and their interpretation questionable.
Let me explain what I appreciated as well as my concerns.
Scientific management implies logic, perspective, and creative thinking
Every scientific study in the business field is a tough study due to the irregular laws that govern this domain. This experiment is no exception. For this reason, I found the subject of the study very appropriate to the times we are in, and the way the authors designed the experiment seems appropriate to the challenge.
Now, without repeating what the authors wrote (I encourage you to read the original paper, see the link above), I found it really smart to include an “outside the frontier” concept to measure the ability of ChatGPT to “reason” beyond its capabilities. Really brilliant.
Another clever choice was to evaluate the variation in content generated by subjects, both AI and humans. That is, the diversity of the answers to given tasks an individual produces with respect to the answers of other subjects.
Statistics: The foundation of scientific management
One couldn’t expect less than this, given the solid educational background this team of authors exhibits, which gives credit to their work. I do, however, find a weakness in the composition of the author’s team: None of them has a qualified statistics background. It is for this reason that I believe the study conclusions are not as solid as one may think.
Professor Carlo Lauro might spend a competent word on the statistics issues I am assuming.
First off, although less of a statistics issue, I wonder whether MBA students are qualified to judge the answers of professional consultants like the employees of the 波士顿谘询公司 enrolled in the experiment.
This experiment was conducted with BCG employees. Therefore, it can hardly be generalized to: Knowledge Worker Productivity and Quality. At best it could refer to the productivity and quality of the work of BCG consultants, although my next point suggests being cautious even with this sort of internal generalization.
Scientific management implies rigor
The Quality of the respondent answers is the primary outcome variable of the experiment. I found worrisome how a score for Quality was derived.
The authors write:
To quantify this quality, we employed a set of human graders to evaluate each question that participants didn’t leave unanswered. ?Each response was evaluated by two human graders [ndr from BCG or MBA students]. We then calculated the mean grade assigned by humans to each question. This gave us 18 dependent variables (one per each question). We subsequently averaged these scores across all questions to derive a composite “Quality” score.
Comparing the R2 coefficients of tables 1 through 3 concerning Quality, Completion, and Timing of Inside the Frontier tasks, the R2 coefficients of Table 1 are surprisingly larger than the same coefficients of Table 2 and 3.
领英推荐
This leads me to think that the construct Quality was treated in a way that induced multicollinearity in the regression model, and results inflated by multicollinearity should be treated accordingly because they hide the true contribution of a variable to the experiment.
Again, this matter of multicollinearity is just an assumption of mine. Perhaps I am wrong. The good thing, in case I’m right, is that multicollinearity can be corrected, and comparing results before and after removing it would make the experiment even more engaging. The interpretation results may of course be different, but I’m sure a team of smart authors like Dell’Acqua et al. will easily turn a little stumble into an achievement of great value.
Following on the use of regressions, I wonder whether the linear regression model is appropriate. The models of this experiment blend dummy (independent) and continuous variables (like the dependent variable Quality). Perhaps a logistic regression model could be more suitable.
Of less relevance, some qualified authorities in the field of experimental statistics discourage the use of answer scales with an even number of categories, like the 10-point answer scale (1-10) of this experiment. Scales with an odd number of categories are preferred, like 0-10, 1-7, etc.
Also, averaging the means of the grades to 18 questions of different relevance and complexity may benefit from the introduction of weights.
Scientific management learns from outliers
Finally, regarding the results of the experiment. Figure 3 is interesting.
It is awesome how high a small number of BCG respondents scored without, or with very limited, use of AI help (dots inside the red shape).
Cool. The best human minds beat the present generation of LLM models.
From the perspective of Mr. Christoph Schweizer , however, I’d like to know more about who these BCG employees are, their background, characteristics, and how to transfer their abilities to their peers.
Closing
Aku, thank you for bringing this experiment to my attention.
I really enjoyed the reading, and I wish to see more studies with large samples of respondents related to the field of the experiment, like this one, as opposed to small samples of students.
Big companies, this is your call to analytical action…
?
?
#Management, #Marketing, #Data, #Science, #Consulting, #BCG, #McKinsey, #Bain, #Innovation, #ChatGPT, #AI
Perfect, insightful, as one expects.