The Stability of Deterministic Outputs — Why Code Generation Tasks Show the Disparity in Trust Among GenAI Use Cases
Original Post on Medium: The Stability of Deterministic Outputs — Why Code Generation Tasks Show the Disparity in Trust Among GenAI Use Cases | by Sam Bobo | Aug, 2024 | Medium
Overpromising and underdelivering.
This three-word statement encapsulates the misalignment in reality between AI practitioners and society (inclusive of consumers, companies, and enterprises). Listening to the headlines, the average person will encounter statements in confidence about the life-altering changes Artificial Intelligence is unveiling in society alongside catchy marketing statements attached with reference customers. On the other side, critics take to blog sites and social media platforms to critique AI capabilities, dampen the messages, and shed a negative light on the practice as a whole, either due to capabilities, threats to careers, or existential fears. With a 10-year career in AI, I’ve immersed myself in the ongoing conversation about AI and experienced first-hand as a Product Manager the benefits, downfalls, and underlying assumptions made to welcome an AI solution into the market. Starting from my years at IBM Watson where customers asked for the “Watson that won Jeopardy!” where in reality, they received a blank corpus to train.
Overpromising and underdelivering. Overpromising and underdelivering.
Over the past few days, I’ve noticed an uptick in news stories covering one use case for Generative AI, code generation. The Financial Times headlined:
AI-powered coding pulls in almost $1bn of funding to claim ‘killer app’ status
which I heard about on the TechMeme Ride Home, or a post from Andy Jassy on X (see a longer press release here):
One of the most tedious (but critical tasks) for software development teams is updating foundational software. It’s not new feature work, and it doesn’t feel like you’re moving the experience forward. As a result, this work is either dreaded or put off for more exciting work — or both. Amazon Q, our GenAI assistant for software development, is trying to bring some light to this heaviness. We have a new code transformation capability, and here’s what we found when we integrated it into our internal systems and applied it to our needed Java upgrades: — The average time to upgrade an application to Java 17 plummeted from what’s typically 50 developer-days to just a few hours. We estimate this has saved us the equivalent of 4,500 developer-years of work (yes, that number is crazy but, real).
Examining GitHub Copilot within this context underscores the benefits of code generation and the sheer reality that code generation is a staller use case for Generative AI. I performed a bit of market research and took to larger economic analysts:
The landscape of software development is undergoing significant transformations, with generative AI poised to revolutionize traditional coding practices. According to a recent report by IBM, generative AI is expected to reduce coding time by up to 30% by 2024, underscoring the technology’s potential to streamline software development processes. This forecast aligns with the broader industry trend towards automating repetitive tasks, thereby allowing developers to focus on more complex and creative aspects of software creation. — markets.us
The 21.5% Compounded Annual Growth Rate (CAGR) leading to $287.4B market size in nearly 10 years is impressive! So why is code generation one of the to use cases for Generative AI? In short, code is bland and predictable. Code, fundamentally, runs on logic programmed by humans and arguably consumes a large percentage of developer’s time to continuously debug, patch, and refactor as simple maintenance. Code linters poke holes in the structural integrity of code, compilers and execution of code quickly reviles bugs, and there are entire industries in exploiting human error along side those to protect it. Writing code requires little creativity (for argument sake, this claim generalizes much of software development, fully appreciating that software engineers are problem solvers, logical, and can harness creativity to drive logical responses) and requires little deviation from standard patterns. Furthermore, code inherently provides a self-check mechanism, it has to run and achieve the desired goal to achieve success.
Peering into training data for codec-based Large Language Models, libraries maintain a detailed level of documentation for the purpose of making developers implementing said library successful. Code forums such as StackedOverflow exist for community-sourced problem solving, whereby, again, common problems can be aggregated, vectorized, and associated with common coding challenges which aid in the code generation process. Early signs of coding assistants started back in 2017 during a TechCrunch hackathon. The ability to “Self-Debug Code” was a topic I posted about in one of my earliest blogging attempts on Developing Innovation:
领英推荐
During this year’s hackathon, the 2nd place winner, CodeCorrect, aided in solving the debugging problem and I found this project fascinating, especially from a new developer standpoint. What does the program do? Essentially, the program adds a <script> into your program and when an error is encountered, will submit a request to the StackedOverflow API with the specific error request and not only return the answer, but implement the changes in your code to fix the problem. If you watch the video on the TechCrunch article for CodeCorrect, you will see the magic come to life.
Now with the recent announcement of Context Free Grammar use within OpenAI, offered as Structured Output, alongside prompting techniques such as few shot learning, code generation has become a “safe” and reliable output with limited “hallucinations.”
The Roadmap to Trust
As I’ve started applying the Roadmap to Trust lens across Generative AI use cases (more on that in a moment) and particularly as I start speaking with customers in the Contact Center industry, what has become abundantly clear is a crevasse distinctly separating more deterministic generative outcomes verses probabilistic when determining what use cases fall on the “Simple Generation” (Phase 1) versus “Recreation” (Phase 2). It should be noted that I am highly pessimistic about use cases expanding into Personalization and Memory (Phase 3) and above for natural language based AI (as opposed to recommendation engines which clearly fir in Phase 3) for purposes of brevity and clarity, the former two phases will be discussed only. Let me expand upon this further.
In order to properly juxtapose the two opposing sides, we must add another use case as a companion to code generation, enter content summarization.
Content summarization is a highly effective Generative AI use case namely due to the popularization and adoption of a technique called Retrieval Augmented Generation (RAG). RAG takes a corpus of knowledge — PDFs, websites, documents — indexes them, and vectorizes them within a vector database. With RAG, the knowledge corpus is limited or constrained to the corpus of knowledge pre-built, not with the vast training underpinning the LLM. Thus, content summarization can be deemed as stable. This stability has lead to the increased use of Copilot within Microsoft Bing or Search Generative Experience (SGE) within Google. These augmented search capabilities even come with citations, another key feature of RAG whereby the model sites the passages and documents used in summarization as another quick debugging mechanism and/or modality for expanding the initial inquiry with further contextual knowledge. I shared thoughts about this within my most recent post: The Very Hungry LLM.
Code Generation and summarization are clear stable use cases for generative AI. These approaches limit the level of hallucinations, provide the intended output, and are backed by self-debugging mechanisms (or equivalent) to validate the results. This, opposed to pure content generation that relies on prompt engineering (and multiple iterations thereof) and answering of generic questions. These systems are highly probabilistic and creative, working off of billions of parameters and tokens within a Large Language Model as it tries to predict the next set of words based on both attention mechanisms and the preceding context within the prompt (including few-shot learning examples). Adoption within this probabilistic can further be bifurcated into two groups: (1) Those who use LLMs for personal use and (2) those who embed LLMs into customer-facing products. The former is self explanatory, which I cover in both “A Roadmap to Trust” and “Solving the Blank Canvas Problem” which I will leave as a thought exercise for the reader.
In addressing the customer-facing products, notably so, there is hesitation in adoption of Generative AI capabilities. For a reference point, I work in the Contact Center space by virtue of joining Nuance Communications. When aiding customers in seeing the value of Generative AI, specifically within an Interactive Voice Response (IVR) application, hesitation arises from the unpredictability of the generative responses, given the application is for self-service and any lack of “truth” degrades the user experience, customer satisfaction, and trust of the brand. When providing recommendations, for example or answers to questions, there is a hesitation to trust these systems based on prior experience and commentary, thus that trust has been broken and needs to be regained.
This, unfortunately, is a major setback for those seeking to accomplish Artificial General Intelligence (AGI), yet, practitioners continue to pursue the concept.
In summary, these exists a significant gap in value between what is promised by AI and what is actually realized, largely based on higher aspirations, notably AGI, and the associated degradation of trust with the populus.
I found this article entitled “Document Extraction is GenAI’s Killer Use App” by Uri Merhav in Towards Data Science who made this elegant point:
The place where hallucinations occur is when you ask it to answer factual questions and expect the model to just “know” the answer from its innate knowledge about the world. LLMs are bad and introspecting about what they know about the world — it’s more like a very happy accident that they can do this at all. They weren’t explicitly trained for that task. What they were trained for is to generate a predictable completion of text sequences. When an LLM is grounded against an input text and needs to answer questions about the content of that text, it does not hallucinate. If you copy & paste this blog post into chatGPT and ask does it teach you how to cook a an American Apple Pie, you will get the right result 100% of the time. For an LLM this is a very predictable task, where it sees a chunk of text, and tries to predict how a competent data analyst would fill a set of predefined fields with predefined outcomes, one of which is {“is cooking discussed”: false}.
In the short term, AI practitioners should be focusing on hardening the rate of successful implementations using stable and predictable, grounded outcomes versus chasing more creative generated content. This will automate more mundane and repeatable tasks, shorten time to goals, and rekindle trust in AI systems. Technological progress takes time. AI was “discovered” back in the 1960s but only came to popularity in the late 2010s due to capability. Just because we aspire does not mean adoption is quite feasible. If anything, a middle ground would be using the Leader-Expert model! I look forward to helping customers achieve their AI vision using proven methods based on both their individual roadmap to trust and the corresponding point on a broader use case-based roadmap.
Visual Storyteller of Uncomfortable Truths | Rebooting Customer Contact | AI & Analytics | Data Detective
2 周Nice piece, Sam