The unnerving capabilities of state-of-the-art chatbots and how to use them

Christian Hendriksen

Assistant Professor at Copenhagen Business School

发布日期: 2024年2月9日

You may have heard Arthur C. Clarke's famous quote, "Any sufficiently advanced technology is indistinguishable from magic." In many ways, this quote captures the current development of generative AI as seen not only by us end users but also by experts in the field of deep learning. When we start a new chat with the current state-of-the-art Large Language Models (LLMs), like GPT-4 or the recently-released Gemeni Ultra, we are basically interacting with technology that can do things so unreasonably well that it might just be magic. Sure, the models have obvious limitations and flaws: they can hardly count their own words or reliably do math. Their writing skills are good, but not better than professional authors. They have inherent biases. And of course, they hallucinate from time to time, making them somewhat unreliable. Yet, despite all this, the core function of these models is absolutely astounding. Having a general purpose technology that, purely through language, can emulate reasoning and apply knowledge on any arbitrary problem is, indeed, indistinguishable from magic.

However, in my view, all of us who are going to be using these Chatbots (and eventually AI agents) for our everyday work have a problem. There's no good template for how to work with these tools, no manual for using these tools exist, and we do not have accurate mental reference points for how to categorize the technology. This and more is something that people like Matt Beane and Ethan Mollick have covered extensively. So how do we actually use these AI tools?

I want to find a way to make these tools useful in the most demanding cognitive professional roles possible. The benchmarks for evaluating LLMs do not capture this at all - for example, the most widely used benchmarks are basically obscure trivia questions. And the examples that Google, OpenAI and others provide are simple, consumer-focused examples or less complex tasks like writing a piece of code. In fact, since GPT-4 released in March 2023, I have seen surprisingly little complex, real-life engagement - outside of programmers and machine learning engineers writing code - with either GPT-4 nor any other capable AI model.

What I am looking for is a way to understand how to make the models make us better in any work role, rather than only automating the boring parts.

But first, we have to grapple with a problem. Most readers of this article will have tried ChatGPT-3.5, but not any more advanced models like GPT-4 or Gemini Ultra. Part of my mission here is to make you aware that if you are basing your AI expectations on what the free version of ChatGPT can do, then it's time to update your beliefs.

In this article, I will do three things. First, I will show that our base expectations about AI are mostly based on ChatGPT-3.5 (i.e. "free" ChatGPT), which poses a problem for experimenting further. Second, I will discuss my own heuristic for how to use state-of-the-art AI models professionally. Third, go through a detailed workflow of my own to illustrate how to interact with a state-of-the-art model in a deeper and more complex way, treating it as a true assistant.

The ChatGPT Conondrum

When ChatGPT came out in late 2022, it seemed like magic at the time. It was using the new GPT-3.5 model that OpenAI had released the same month, and it shocked most people with its level of writing. Arguably, the reason that this was a watershed moment was not because there was a significant capability jump between GPT-3 and GPT-3.5 - it was because anyone who bothered to sign up could try out ChatGPT for free. Because of this massive success - and because most of us have tried the free version of ChatGPT at least once - our mental benchmark for what AI can do is rooted in this experience.

This means that the vast majority of people think of AI capabilities with reference to ChatGPT-3.5, simply because we all tried it. And while 3.5 was impressive in terms of its capability in generating language, it did not exhibit any sign of actual intelligence nor advanced reasoning. In practice, this means that anyone who have tried ChatGPT-3.5 but not any other model has their expectations set based on the performance of GPT-3.5

This is a problem. For organizations or teams that want to experiment with how to use AI productively, we now have to overcome this initial expectation to make people think about more advanced uses. The reason is that people with an expectation rooted in 3.5 mostly thinks about how to offload tasks to the AI but not use it for cognitive work. Thus, if we are mainly thinking about having the AI write emails for us, we are not getting close the possible potential of state-of-the-art models or possible future models.

It's hard to give very clear examples of the performance difference between 3.5 and 4 (and Gemini Ultra), mainly because most actual use case demonstrations are very subjective (because they are more realistic). In one instance just last week, I showed students GPT-3.5 vs GPT-4 on a reasonably complex workshop task in one of my diploma classes. Students were asked to give recommendations to a hypothetical retail firm based on fictional interview quotes, and in the second part of the class explain how new information led them to update their beliefs. A core part of the exercise consisted of seeing that new information shifted the relative value of other interviewees. Remarkably, GPT-4 could very clearly see the motivations behind the fictional interviewee who was not to be trusted, while GPT-3.5 seemed to not pick up on this. These kind of examples become common, but they always require in-depth realistic engagement with the model.

GPT-3.5 when giving its answer on the testimony by a fictional interviewee in a workshop case.

GPT-4 when prompted identically to GPT-3.5 above.

The takeaway is this: Free yourself from ChatGPT-3.5 as the benchmark for your AI expectations. Try Microsoft Copilot, or Gemini Advanced, or ChatGPT Plus and play around with state-of-the-art models to get a sense of what they can and cannot do. They are really a different class of models compared to GPT-3.5

Three ways of interacting with the AI

I have read and enjoyed the work by Lodge and colleagues who have categorized students' use of generative AI tools. But their approach is focused on higher education, not general-purpose professional use. So informally, I have begun operating with a few key categories of use that are inspired by Lodge et al., but in practice are different. These categories are heuristic devices for thinking about how to use AI.

Offloading tasks to the AI

When you offload work to an AI, you basically automate part of your work. You could write that long, formal e-mail yourself, but ChatGPT does it better and faster. You could make a one-page summary of your research project for an external stakeholder, but Copilot does it faster. You could read a lengthy report, but any AI of your choice can give you a one-page executive summary with a focus of your choice.

A fairly standard example of offloading a task to GPT-4. The reason that GPT-4 "knows" who I am and what I do is because of my custom instructions.

Most of us think about offloading when we are thinking about AI tools: The ability to automate boring or repetitive tasks, or tasks that give us no joy and take time away from more important or more fun tasks.

For many offloading tasks, good old ChatGPT-3.5 will do fine. It's not going to make your e-mail game stellar, but it will do fine. It can summarize a decent amount of text. It can also write the song for the Christmas Party. Of course, more capable models with more tools to use will be able to do more. I can give GPT-4 a spreadsheet with a ton of data and quickly run through some simple analyses rather than doing it myself. Or I can have Gemini Ultra search the web for some specific news items to update me on a topic. But fundamentally, this category of use is about automating tasks we prefer not to do ourselves.

Getting advice from the AI

In many situations, a key advantage of the best AI models is that they are better experts than you on most topics you are not explicitly trained in. Thus, asking the AI for advice - not about facts, but about concepts as applied to a real problem - can actually be extremely beneficial.

For example, I ran a demonstration of GPT-4 in a smaller company in the fall of 2023. I tested out how GPT-4 would advise the company to handle suppliers that would be unable to report CO2 emissions, which can be a challenge for smaller companies without the necessary capabilities. GPT-4 gave a pretty good plan, in line with what I would say myself. But when I presented, I learned that the company had had the exact problem that I had asked GPT-4 about, and they had done more or less what GPT-4 had said - but it had taken them months to figure it out because they did not have in-house expertise. GPT-4 had given me the answer in a couple of minutes.

Simple example that corresponds to a problem that many firms have.

GPT-4 giving its answer. Pretty decent answer, but not groundbreaking.

Exerpt from Gemini Ultra's answer to the same prompt.

To me, this is an instructive example of how these models can provide value as advisors without producing anything directly. The value here is the additional perspective or advice that we humans can take as input if we want, not the actual content itself. And we if we primed to think of AI chatbots as primarily useful for producing texts for offloading, it is harder to think about how AI acting as advisors can be useful in different situations.

Note that AI advisory avoids the problem of hallucinations that is a key limitation of language models. Advisory is not about recalling precise facts or looking up information in a large table; it is about asking the AI to draw on its conceptual knowledge to apply it to a specific situation. This makes the LLM strong rather than weak in this capacity. They key is not to rely on the LLM for precise facts.

Working with the AI as an assistant

Other people, like Ethan Mollick, have already shown that state-of-the-art AI systems act much like interns or assistants. But what does this mode of interaction look like? For me, working with an AI in assistant mode is a matter of constantly iterating on the AI output, using it in different capacaties for different types of sub-tasks, and weaving together part human and part AI to produce some output. This is what Dell'Acqua and colleages refer to as either "cyborg" or "centaur" modes of working. Centaurs tend to divide up the work between AI and human, while cyborgs tend to integrate human-AI collaboration throughout their workflow.

While the distinction is probably highly task dependent (even if Dell'Acqua and colleagues found their participants differed on identical tasks), the better question for me is how this mode of interaction looks in practice. Using AI as an assistant is about integrating human-AI expertise where it makes sense, and delegate tasks to either human or AI when that, in turn, is a better solution.

In my mind, there are two key parts of using AI as assistants in a productive way. First, it is important to iterate rapidly on both human and AI side, at least for some parts of the workflow. Second, the human should start by having at least a cursory idea of what the AI strategy should be, meaning 'how should I roughly be thinking about what I can do with AI in this workflow?'.

I'll illustrate here with a simple example that is fairly complex but also quite illustrative of some of the complex work I do on a daily basis as a research. The task at hand is simple: Develop a 1000-word extended abstract that introduces a novel theory of individual AI usage. Let's use good-old trusty ChatGPT-4 for this task.

Illustrative AI workflow: Developing a new theory and an extended abstract using GenAI

Let's start with the basics: What should our AI strategy be? In this case, we are using GPT-4 with its suite of tools, so we know it can do the following reasonably well:

Long context length of at least 32k tokens / 25k words, possibly 128k (OpenAI does not disclose this)
Integrated search, code interpreter / data manipulation, image generation
Access to custom GPTs through the GPT store
Generally high-quality reasoning on abstract topics
Decent writing, but not better than professional writers
Ability to read most files

In addition, we should be aware of the standard limitions: it can hallucinate, it is biased in some ways, and it requires context to provide output that is relevant to our situation. With that, let's formulate a simple strategy:

I will use the AI as an assistant to brainstorm and co-develop the core tenets of the theoretical framework by discussing directly with the AI. The AI will help checking for relevant references and sharpen the argument and structure. The human will do the final writing maintain control over decisions about theorizing. As we move towards finalizing, the human will do more of the work to ensure the final output aligns with expectations.

This is not rocket science, but it is a nice way to think about how to use the AI. Let's begin! You can view the chat in its entirety by following this link.

https://chat.openai.com/share/e/ac3b5a77-e3d5-4236-aac7-ce56cde2f515

So we have given GPT-4 an interesting challenge. Let's see how it starts out.

We are off to a good start. While I like the tactical vs strategic dimensions, the autonomous vs collaborative dimension does not seem super interesting if we are trying to theorize something like ChatGPT usage (because it is not autonomous, yet).

We can ignore some of these recommendations as we are going to go with other axes, but from the human side, I think it is already highlighting some theories that are relevant, if not new or radical.

Yes, real world examples are nice. Sure, then we can write the extended abstract. This is not a complex process but it is nice that GPT-4 spells out how to deconstruct the workflow. The challenges at the bottom are good, even if they are not new to experienced scholars.

The follow-up questions arise because I have asked GPT-4 to ask me follow-up questions when it thinks it is appropriate. And it is indeed appropriate here!

Getting this initial answer from GPT-4 has served two purposes: First, it has given me some ideas for what could be interesting pathways. It is easier to think about what we want or don't want when we see the first draft, so GPT-4 has leapfrogged the human equivalent which would be making an early draft of a theoretical sketch. Second, it has given us a basis on which to continue iterating with GPT-4 since we can just inject our ideas into the process. As noted earlier, I think that the axes should be different. Let's play to the AI's strength and let it identify the implications.

Let's see how GPT-4 reacts.

领英推荐

Gen AI for Business #5

Eugina Jordan 6 个月前

Generative AI Explained - Its Impact And Future In…

Bertalan Meskó, MD, PhD 1 年前

Almost Timely News: Improving the Performance of…

Christopher Penn 11 个月前

This is not bad. It is not a Nobel prize, but nevertheless useful as a theoretical heuristic. As the human in the loop, this is where I would say "Yup, this direction works." and then continue iterating on the basis of these ideas.

At this point, I call up the Consensus GPT inside the conversation. This GPT is hooked into a few hundred million academic texts, so I strike up a conversation with this specialized GPT to get my bearing on relevant theories and papers.

In ChatGPT-4, you can write "@" and then call up GPTs that you have used recently or saved in your sidebar.

Here, I am moving into a kind of search pattern with the AI trying to establish whether it is a good path forward and if so, how I can relate my theorizing to existing papers. Consensus whips up D'amour et al, which I hadn't seen before. For demonstration purposes, I'll go ahead and say that this perspective is relevant. Note that the human role here is managerial in nature where I make decisions based on information provided by the AI, but not allowing the AI to make decisions by itself. Of course, the AI is, in a sense, constraining my perspective by offering me a certain selection of the world - and it is crucial to note that navigating this still requires human expertise. If I was an inexperienced researcher, I would not know whether this or that theory would be more interesting. But humans carrying expertise can quickly identify whether the AI is serving up suggestions that are good, and based on that decide whether to swing the AI in a different direction. I could, for instance, have told GPT-4 that the range of theories it suggests are rubbish and go in a specific direction, but that necessitates that the human in charge has expert knowledge that they can mobilize.

Next, I ask GPT-4 to relate it to our dimensions.

Okay, so far so good. As the human in charge, I make sure to open the D'amour et al. paper and give it a read to double-check the AI's interpretation (which in this case seems correct). By this point, we probably have a working early theory that can develop further.

Now comes a tricky part of the writing. I have been going back-and-forth with GPT-4 do have it give me ideas that I can evaluate and use as the basis for further iteration. With writing, it becomes a little different. I know that GPT-4 is good enough at standard writing, but for an actual academic article, it is not entirely up to snuff. Let's start out by drafting a structure. I call up another GPT for this purpose.

Write For Me is one of the many GPTs that are specialized in writing in a precise way. Since GPT-4 by default is not super precise in following writing instructions, it is often necessary to call up a specialized GPT.

This is a fairly solid structure for writing the actual extended abstract. Depending on your writing style, you will then either take this structure and make it your own before writing it out, take it as inspiration before freestyling your way through, or you can have the AI develop an early draft. It is usually a good idea to do this section by section. I have inserted the final draft of the AI version before I have done any more work.

From here, the utility of the AI limited. You can see that it is missing some things, like a good and thorough explanation of the axes, and it has made the dreaded bullet point explanation of D'amour instead of explaining in simple terms how their concepts matter. It is easier for a human to go through the draft and revise (or rewrite entirely). Any experienced researcher can see that this is not a done draft, but it is a pretty good starting point.

That AI can get me so far on a task that is pretty specialized and has to follow an exact format (whether it is articles for journals or abstracts for conferences) is remarkable. This back-and-forth with the AI where it acts an assistant makes many things much faster even if the human is still in control. This example is useful because it is something I do quite often (and thus, at least claim some expertise in) so I can clearly highlight when and how I draw on the AI to do some things and not others. And I know, because of my expertise, when I should overrule the AI and set a new direction or take the product and make it my own from there. This is, to me, a better way to use LLM assistant tools than only thinking about them as automation tools. But it requires experimentation for the individual to get a sense of how this can work in your own workflows.

I should say that this example does not mean it is necessarily a good idea to have AI write research articles. But if they become indistinguishable from human-made research articles, what then? If I publish a theoretical article based on empirics where I co-developed the theoretical part with an AI, but the writing is my own, is it then a problem? And if not, who would have known? I find it fascinating that we are surprisingly close to having off-the-shelf AI systems capable of mimicking at least some abstract theorizing that we as researchers thought would be one of the last bastions to fall to automated systems. Yet, here we are.

Extended Abstract: Pure AI developed version

Introduction

The rapid advancement and integration of generative Artificial Intelligence (AI) into the workplace necessitates a deeper understanding of how individuals utilize these technologies in various work settings. This exploration is crucial for harnessing AI's full potential to enhance productivity, innovation, and collaboration. However, the diverse ways in which generative AI is employed across different tasks and organizational contexts remain underexplored. To address this gap, we propose a novel theory presented as a 2x2 typology that categorizes the use of generative AI at the individual level. This typology is designed to aid managers and practitioners in navigating the complexities of AI integration, ensuring its alignment with strategic goals and collaborative processes. By grounding our theory in the principles of social interdependence and collaboration, we aim to offer a comprehensive framework that not only elucidates the multifaceted applications of generative AI but also fosters a deeper understanding of its implications for work dynamics and organizational outcomes.

Theoretical Background

Ultra-brief Overview of Social Interdependence Theory and Collaboration

Social Interdependence Theory posits that the interactions among group members significantly influence outcomes, shaping the dynamics between cooperative and competitive endeavors. At its core, the theory suggests that the structure of interdependencies among team members determines the direction, magnitude, and quality of interpersonal relations, ultimately affecting performance and satisfaction levels. This theoretical lens provides a foundational perspective for examining the collaborative use of generative AI, highlighting the importance of understanding the interplay between technology and social dynamics in work settings.

Five Elements of Collaboration based on D'Amour et al.

The framework proposed by D'Amour et al. (2005) identifies five critical elements of collaboration: sharing, partnership, power, interdependency, and process. These elements serve as the building blocks for effective collaboration, each contributing uniquely to the collaborative endeavor:

Sharing encompasses the mutual exchange of resources, knowledge, and responsibilities among team members, facilitated by generative AI through enhanced communication and information dissemination.
Partnership refers to the cooperative relationships developed among individuals and teams, where generative AI can support by enabling more equitable participation and contribution.
Power highlights the distribution of decision-making authority and autonomy within collaborative contexts, with generative AI offering tools for more informed and democratized decision processes.
Interdependency underscores the reliance among team members on each other's tasks and outputs, a dynamic that generative AI can optimize through streamlined coordination and integration of efforts.
Process involves the methods and procedures that govern collaboration, where generative AI can introduce efficiencies and innovations in how work is conducted and coordinated.

Together, these elements provide a multifaceted understanding of collaboration that is essential for integrating generative AI into work practices in a way that enhances, rather than disrupts, team dynamics and organizational objectives.

Explanation of the Theory and the 2x2 Typology

The novel theory is encapsulated in a 2x2 typology that categorizes the use of generative AI based on the dual dimensions of Purpose of Use (Tactical vs. Strategic) and Degree of Coordination with Other Humans (Low vs. High). This typology aids in understanding the nuanced ways individuals leverage generative AI in their work, offering insights into the design and implementation of AI tools to enhance workplace collaboration and productivity.

The table generated by the AI does not translate well into the LinkedIn article format.

Tactical-Low Coordination

This quadrant represents roles like Data Analysts who utilize generative AI for automating routine data processing tasks, thereby enhancing their personal productivity with minimal need for coordination with others.

Tactical-High Coordination

In roles such as Project Managers, generative AI facilitates high levels of team communication and coordination, streamlining project workflows and enhancing the efficiency of collaborative tasks.

Strategic-Low Coordination

Strategy Consultants use generative AI to inform high-level decision-making processes. Here, AI tools provide deep insights into market trends and opportunities, supporting strategic planning with limited direct collaboration.

Strategic-High Coordination

Product Development Teams operating in this quadrant leverage generative AI to foster innovation and strategic planning in highly collaborative settings. AI tools are used to generate new ideas, enhance product design, and formulate market strategies collaboratively.

Discussion

Implications for Management Practice

The 2x2 typology offers valuable insights for managers and organizational leaders on effectively integrating generative AI into work processes. It underscores the importance of aligning AI tools with both the strategic orientation of tasks and the required degree of human coordination. For instance, understanding that a Project Manager operates in the Tactical-High Coordination quadrant can guide the selection of AI tools that enhance team collaboration and communication, as opposed to tools that solely focus on individual productivity.

Future Research Directions

Future research directions should focus on several key areas to build upon the foundation established by the 2x2 typology of generative AI use in work settings. Firstly, empirical testing is crucial to validate the typology's applicability and effectiveness across various industries and organizational contexts, ensuring its broad relevance and utility. Additionally, there is a significant opportunity to explore the design of AI tools specifically tailored to support the distinct needs identified within each quadrant of the typology, particularly in terms of balancing tactical versus strategic purposes and accommodating different degrees of human coordination. Investigating the impact of generative AI on team dynamics, communication patterns, and organizational culture within these quadrants can also provide valuable insights. Such research could offer a deeper understanding of how AI integration influences work environments and contributes to organizational goals, potentially guiding more effective strategies for leveraging AI in collaborative and strategic endeavors.

Conclusion

The proposed 2x2 typology offers a novel framework for categorizing the use of generative AI in work settings, bridging the gap between technological capabilities and human collaboration needs. By aligning AI tool design and implementation with the strategic and coordination requirements of different roles, organizations can maximize the benefits of generative AI, fostering enhanced productivity, innovation, and collaboration.

This extended abstract lays the groundwork for a comprehensive exploration of generative AI's role in the modern workplace, inviting further research and discussion on optimizing AI integration for collaborative and strategic success.

Bren Kinfa ??

Founder of SaaSAITools.com | #1 Product of the Day ?? | Helping 15,000+ Founders Discover the Best AI & SaaS Tools for Free | Curated Tools & Resources for Creators & Founders ??

9 个月

Great insights on embracing advanced AI models like Gemini. Looking forward to reading your article! ??

Sameer T.

Global Sales Strategy and Planning | New Business Development, CRM

9 个月

Awesome article! Can't wait to explore the possibilities with Gemini Advanced and GPT-4.

Free AI Tools & ChatGPT Prompts ??

9 个月

This is great! Looking forward to reading your article.

Chris Brown

Business Leader Offering a Track Record of Achievement in Project Management, Marketing, And Financial.

9 个月

Awesome article! Thanks for sharing your insights on the potential of advanced generative AI models like GPT-4 and Gemini. Looking forward to exploring new possibilities!

Joachim Delventhal

Strategic Advisor | Responsible Business Conduct | ESG & Sustainability Leader

9 个月

Thanks Christian! Interesting read and I would like to highlight what you are saying about expertise. AI tools are very good at generating (seemingly) convincing outputs but unless one really knows a subject matter, it is precarious to rely on what they “say” and it is important to guide them. That being said, I definitely profited from reading your article. Keep them coming

查看更多评论

要查看或添加评论，请登录

Artificial Intelligence: Between Hope and Hype

2024年9月11日

The unnerving capabilities of state-of-the-art chatbots and how to use them

Christian Hendriksen

Assistant Professor at Copenhagen Business School

The ChatGPT Conondrum

Three ways of interacting with the AI

Illustrative AI workflow: Developing a new theory and an extended abstract using GenAI

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Can GPTZero be relied upon for AI Detection accuracy?

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free Access to GPT-4; Weekly Concept; To Handle Increased Stress, build resilience; and more.

Generative AI: The Future is Here and It's Writing Itself

The Dawn of Affordable Intelligence: GPT-4o mini Reshapes the AI Landscape

Claude 2 vs GPT-4 in 2023: Comparing the Top AI Models

GPT-4 Cheat Sheet: What Is GPT-4, and What Is it Capable Of?

Leveraging Generative AI & Language Models for Businesses - How To Build Your Own Large Language Model

Popular Generative AI Terms You Should Know

GenAI for Dummies

Why Do We Need Neuro-symbolic AI to Model Pragmatic Analogies?

The ChatGPT Conondrum

Three ways of interacting with the AI

Illustrative AI workflow: Developing a new theory and an extended abstract using GenAI

领英推荐

Artificial Intelligence: Between Hope and Hype

2024年9月11日

社区洞察

其他会员也浏览了

Can GPTZero be relied upon for AI Detection accuracy?

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free Access to GPT-4; Weekly Concept; To Handle Increased Stress, build resilience; and more.

Generative AI: The Future is Here and It's Writing Itself

The Dawn of Affordable Intelligence: GPT-4o mini Reshapes the AI Landscape

Claude 2 vs GPT-4 in 2023: Comparing the Top AI Models

GPT-4 Cheat Sheet: What Is GPT-4, and What Is it Capable Of?

Leveraging Generative AI & Language Models for Businesses - How To Build Your Own Large Language Model

Popular Generative AI Terms You Should Know

GenAI for Dummies

Why Do We Need Neuro-symbolic AI to Model Pragmatic Analogies?