First Do No Harm: The Hippocratic Oath for GenAI Results
The Amazon Web Services (AWS) reinvent conference was an overwhelming and humbling experience, and I am grateful that I had the privilege of attending. I was humbled by the scale and at times I felt really small. It’s kind of spellbinding to be a small business owner and tech startup founder traversing a literal ocean of massive companies with stupendous budgets. And I must admit, AWS knows what it’s doing. This conference is a testament to their success – when 52,000 people are willing to pay for the privilege of being sold to, and then literally wait in line at your booth to buy your product? Well, that’s got to be the textbook definition of product market fit.
And yet those of us building with generative artificial intelligence (genAI) still put our pants on the same way each day. We are all struggling with the same pesky problems. Can I still get the value if I keep my data private? How do I get the darn answers to be accurate? How do I explain this answer to a regulator? And all this is about overcoming the major obstacles we face in adoption – a central one being FEAR.
I’ll be honest and say that I’ve been doing just a smidge of handwaving when it comes to articulating the finer points around responsible AI, large language model (LLM) guardrails, and governance. I mean really, so much jargon and so little time to get in there and make it real in a production software application at the aforementioned small but mighty tech startup. It feels like a big ole word salad. Who’s got time for all this?
If you are like me and are actively building to overcome hallucinations, come over here and join me at my little table in the corner of this dive bar. We are serving word salad. Can’t always trust the results you are getting? Yep, we are serving that too. This pesky little monster of a problem is a problem of reliability – one of the core dimensions of responsible AI (RAI). In our darkest moments, those of us building with genAI every day step back and ask ourselves – have I made it better? Yes, it’s cool. Yes, it’s fun. Yes, it will get me a customer meeting – but did I really add value? Was the juice worth the squeeze? How can I be sure?
To me, that’s where the useful parts of RAI come in – let’s measure and account for the value we believe we have created. Let's establish our value hypothesis and prove to ourselves and to our customers that we consistently delivered, and will continue to consistently deliver, that value in a way that benefits humanity.
Responsible AI is the practice of developing, deploying, and using artificial intelligence systems in ways that are valuable (ethical, transparent, fair, accountable, and beneficial to society) while actively preventing harm.
I won’t spend any time in this article on the other six pillars of RAI, I want to really zoom in on reliability.
Reliability is the core requirement that we simply must make sure the results we generate in our genAI applications are trustworthy. They are truthful. They are complete. They make sense. These qualities are so intuitive to us as beings with human intelligence, but how do we make sure that a machine with the ability to mimic human intelligence isn't just mimicking truthfulness? Yeah, it boggles my mind, too.
Veracity is the quality of being true or the habit of telling the truth. High veracity ensures that AI systems provide accurate, factual, and trustworthy responses, which are essential for tasks such as content generation and decision support systems. The term often appears in contexts involving the evaluation of statements, claims, or information to determine their truthfulness.
Reliable genAI results possess the qualities of veracity and robustness, meaning they are both truthful and sturdy. They don’t leave anything important out, within the context of the request. Veracity is interesting as it is dependent on context. We want our results to be complete. But completeness within the context of a request for a one sentence answer is different from the kind of completeness we expect from a dissertation length paper. This is one of the reasons the word “comprehensive” is such a trigger word for me. Comprehensive to whom? In what context? Would my “comprehensiveness” withstand scrutiny in a court of law? But I digress.
领英推荐
There are (at least) five types of veracity:
Sounds hard right? It is. The good news is we can measure veracity, and we can improve veracity in many ways. However, to improve, we must first measure. Some key veracity metrics:
When we compute these metrics, we are evaluating the genAI system for its veracity. How well is our system performing? Did we first do no harm? Did we at least not make things worse? And of course, from there we want to make sure we have really added value. We made things better.
Evaluations in the context of genAI refer to the systematic assessment of how well AI models perform across various dimensions. Evaluations can be performed by humans or by machines. Human evaluation is often achieved through “human-in-the-loop” approaches. Content rating is one form of evaluation.
One framework for evaluating genAI systems is the retrieval augmented generations assessment system (RAGAS) framework, which evaluates the quality and truthfulness of AI-generated responses. I won’t cover RAGAS in detail here, but you can google it or ask Claude. Better yet, reach out to Ian Webster and the guys over at Promptfoo to get started with their open source evaluation framework.
It's easy to get really bogged down in all the frameworks and tools we need to implement to deliver responsible AI. I keep telling my team "we need evals!" but I have not really provided any guidance on where to start or why we should start there, which isn't fair. We have started with a RAGAS score as it bring so many different aspects of both content veracity and pipeline effectiveness together.
As Jensen would say, I wish you ample amounts of struggle and suffering as we figure this out together. Happy building.
Leader, Strategy & Analytics (AI & Data) at Deloitte Consulting LLP
2 个月Wow! Thank you for your energy! What a powerful, thoughtful, inspiring recap heading into the new year. Hoping others click the link and nerd out like I did! Also… as an ex college coach (many years back) congrats on your undergrad alma mater winning the men’s NCAA soccer championship this week! Love a good underdog story!