登录查看更多内容

First Do No Harm: The Hippocratic Oath for GenAI Results

Tela Gallagher Mathias

Making mortgage modern with genAI.

发布日期: 2024年12月16日

The Amazon Web Services (AWS) reinvent conference was an overwhelming and humbling experience, and I am grateful that I had the privilege of attending. I was humbled by the scale and at times I felt really small. It’s kind of spellbinding to be a small business owner and tech startup founder traversing a literal ocean of massive companies with stupendous budgets. And I must admit, AWS knows what it’s doing. This conference is a testament to their success – when 52,000 people are willing to pay for the privilege of being sold to, and then literally wait in line at your booth to buy your product? Well, that’s got to be the textbook definition of product market fit.

The author feeling small while traversing the literal ocean of massive companies with stupendous budgets.

And yet those of us building with generative artificial intelligence (genAI) still put our pants on the same way each day. We are all struggling with the same pesky problems. Can I still get the value if I keep my data private? How do I get the darn answers to be accurate? How do I explain this answer to a regulator? And all this is about overcoming the major obstacles we face in adoption – a central one being FEAR.

I’ll be honest and say that I’ve been doing just a smidge of handwaving when it comes to articulating the finer points around responsible AI, large language model (LLM) guardrails, and governance. I mean really, so much jargon and so little time to get in there and make it real in a production software application at the aforementioned small but mighty tech startup. It feels like a big ole word salad. Who’s got time for all this?

If you are like me and are actively building to overcome hallucinations, come over here and join me at my little table in the corner of this dive bar. We are serving word salad. Can’t always trust the results you are getting? Yep, we are serving that too. This pesky little monster of a problem is a problem of reliability – one of the core dimensions of responsible AI (RAI). In our darkest moments, those of us building with genAI every day step back and ask ourselves – have I made it better? Yes, it’s cool. Yes, it’s fun. Yes, it will get me a customer meeting – but did I really add value? Was the juice worth the squeeze? How can I be sure?

To me, that’s where the useful parts of RAI come in – let’s measure and account for the value we believe we have created. Let's establish our value hypothesis and prove to ourselves and to our customers that we consistently delivered, and will continue to consistently deliver, that value in a way that benefits humanity.

Responsible AI is the practice of developing, deploying, and using artificial intelligence systems in ways that are valuable (ethical, transparent, fair, accountable, and beneficial to society) while actively preventing harm.

I won’t spend any time in this article on the other six pillars of RAI, I want to really zoom in on reliability.

Reliability is the core requirement that we simply must make sure the results we generate in our genAI applications are trustworthy. They are truthful. They are complete. They make sense. These qualities are so intuitive to us as beings with human intelligence, but how do we make sure that a machine with the ability to mimic human intelligence isn't just mimicking truthfulness? Yeah, it boggles my mind, too.

Veracity is the quality of being true or the habit of telling the truth. High veracity ensures that AI systems provide accurate, factual, and trustworthy responses, which are essential for tasks such as content generation and decision support systems. The term often appears in contexts involving the evaluation of statements, claims, or information to determine their truthfulness.

Reliable genAI results possess the qualities of veracity and robustness, meaning they are both truthful and sturdy. They don’t leave anything important out, within the context of the request. Veracity is interesting as it is dependent on context. We want our results to be complete. But completeness within the context of a request for a one sentence answer is different from the kind of completeness we expect from a dissertation length paper. This is one of the reasons the word “comprehensive” is such a trigger word for me. Comprehensive to whom? In what context? Would my “comprehensiveness” withstand scrutiny in a court of law? But I digress.

领英推荐

GP Bullhound's weekly review of the latest news in…

GP Bullhound 1 年前

Deploy Any Model on Any Compute, at Any Scale!??

Clarifai 1 个月前

AWS re:Invent 2024 Recap: Generative AI Takes Center…

Genese Solution 1 个月前

There are (at least) five types of veracity:

Credibility – The degree to which information comes from trustworthy, authoritative sources with verifiable expertise or evidence to support their claims.
Factuality – The extent to which information aligns with established facts, empirical evidence, and objective reality rather than opinions, assumptions, or falsehoods.
Coherence – The logical consistency and clear relationship between different parts of information, ensuring that claims and statements fit together without contradictions.
Completeness – The degree to which all necessary and relevant information is included, without crucial omissions that would affect understanding or decision-making.
Temporal – The accuracy and currency of information with respect to time, including whether it reflects the most up-to-date knowledge and properly represents the period it claims to describe.

Sounds hard right? It is. The good news is we can measure veracity, and we can improve veracity in many ways. However, to improve, we must first measure. Some key veracity metrics:

Accuracy – The proportion of all predictions (both positive and negative) that were correctly classified by the model out of the total number of cases evaluated.
Precision – The proportion of positive identifications that were correct, indicating how often the model was right when it predicted the positive class.
Recall – The proportion of actual positives that were identified correctly, showing how many of the total positive cases the model successfully identified.
F1 Score – The harmonic mean of precision and recall that provides a single score balancing the tradeoff between precision and recall, particularly useful when dealing with imbalanced datasets.

When we compute these metrics, we are evaluating the genAI system for its veracity. How well is our system performing? Did we first do no harm? Did we at least not make things worse? And of course, from there we want to make sure we have really added value. We made things better.

Evaluations in the context of genAI refer to the systematic assessment of how well AI models perform across various dimensions. Evaluations can be performed by humans or by machines. Human evaluation is often achieved through “human-in-the-loop” approaches. Content rating is one form of evaluation.

One framework for evaluating genAI systems is the retrieval augmented generations assessment system (RAGAS) framework, which evaluates the quality and truthfulness of AI-generated responses. I won’t cover RAGAS in detail here, but you can google it or ask Claude. Better yet, reach out to Ian Webster and the guys over at Promptfoo to get started with their open source evaluation framework.

What I love about the RAGAS score is that it is simple to articulate and hits both content veracity and pipeline effectiveness.

It's easy to get really bogged down in all the frameworks and tools we need to implement to deliver responsible AI. I keep telling my team "we need evals!" but I have not really provided any guidance on where to start or why we should start there, which isn't fair. We have started with a RAGAS score as it bring so many different aspects of both content veracity and pipeline effectiveness together.

As Jensen would say, I wish you ample amounts of struggle and suffering as we figure this out together. Happy building.

Sarah Hughes

Leader, Strategy & Analytics (AI & Data) at Deloitte Consulting LLP

2 个月

Wow! Thank you for your energy! What a powerful, thoughtful, inspiring recap heading into the new year. Hoping others click the link and nerd out like I did! Also… as an ex college coach (many years back) congrats on your undergrad alma mater winning the men’s NCAA soccer championship this week! Love a good underdog story!

1 次回应

查看更多评论

要查看或添加评论，请登录

Tela Gallagher Mathias的更多文章

From Trolling to Subscribing – An Alternative to Compliance Insanity

2025年2月28日

From Trolling to Subscribing – An Alternative to Compliance Insanity

I’ve talked before about how managing regulatory change in mortgage is kind of like trolling the internet. Mortgage…
From Program Management to Program Efficiency and Innovation

2025年2月17日

From Program Management to Program Efficiency and Innovation

Traditionally, federal housing agencies, regulators, and the mortgage industry as a whole have relied on Program…

11 条评论
The Evolution of Service Level Agreements: Why AI Evaluations Matter in Mortgage

2025年2月2日

The Evolution of Service Level Agreements: Why AI Evaluations Matter in Mortgage

Traditional service level agreements (SLAs) are how we measure technology performance in the mortgage industry, and…

5 条评论
An Impassioned Plea for AI-Ready Mortgage Policy Data

2025年1月30日

An Impassioned Plea for AI-Ready Mortgage Policy Data

The single hardest problem in mortgage is compliance. Being compliant.

5 条评论
The Role of Mortgage Regulators in Generative AI

2025年1月18日

The Role of Mortgage Regulators in Generative AI

In speaking engaging with industry about the uses of generative AI (genAI) in mortgage, I sometimes get questions from…
Learn, Build, Repeat

2025年1月2日

Learn, Build, Repeat

Almost exactly a year ago, I stood at the front of my living room at our quarterly partner planning session…

10 条评论
The History of Artificial Intelligence in Mortgage

2024年11月18日

The History of Artificial Intelligence in Mortgage

The evolution of artificial intelligence (AI) in mortgage is really just the history of technology in mortgage. In…

11 条评论
Adoption of GenAI is Outpacing the Internet and the Personal Computer

2024年11月8日

Adoption of GenAI is Outpacing the Internet and the Personal Computer

A fascinating study came out in September regarding the rate of adoption of genAI in the United States. Bottom line up…

3 条评论
Application of Large Language Model (LLM) Guardrails in Mortgage

2024年10月12日

Application of Large Language Model (LLM) Guardrails in Mortgage

Recently I was asked about the use of guardrails in our product and put together a white paper to formalize our…

1 条评论
Where is GenAI Going in Mortgage?

2024年9月4日

Where is GenAI Going in Mortgage?

There is endless hype and little real information about how to put genAI safely into practice in the mortgage industry.…

10 条评论

See all articles

First Do No Harm: The Hippocratic Oath for GenAI Results

Tela Gallagher Mathias

Making mortgage modern with genAI.

领英推荐

Tela Gallagher Mathias的更多文章

社区洞察

其他会员也浏览了

?? Ride the Databricks Wave - Community Launch & Major Wins!

6 Levels of ML Adoption shared by Ali Arsanjani, Director of Cloud Partner Engineering at Google Cloud

How to solve data ingestion & feature store + other resources

Between Test & Train #5

?? Celebrating 2023 - Making AI Frictionless ??

Trustless AI: io.net and Mira Lock In Reliability

Exploring AWS’s Role in Enabling Cloud-Based AI and ML

DeepThoughts on The Boom in AI Chips

?? Unlocking the Power of Azure AI! ??

What does Snowflake Acquiring Myst.AI do?

领英推荐

Tela Gallagher Mathias的更多文章

From Trolling to Subscribing – An Alternative to Compliance Insanity

From Program Management to Program Efficiency and Innovation

The Evolution of Service Level Agreements: Why AI Evaluations Matter in Mortgage

An Impassioned Plea for AI-Ready Mortgage Policy Data

The Role of Mortgage Regulators in Generative AI

Learn, Build, Repeat

The History of Artificial Intelligence in Mortgage

Adoption of GenAI is Outpacing the Internet and the Personal Computer

Application of Large Language Model (LLM) Guardrails in Mortgage

Where is GenAI Going in Mortgage?

社区洞察

其他会员也浏览了

?? Ride the Databricks Wave - Community Launch & Major Wins!

6 Levels of ML Adoption shared by Ali Arsanjani, Director of Cloud Partner Engineering at Google Cloud

How to solve data ingestion & feature store + other resources

Between Test & Train #5

?? Celebrating 2023 - Making AI Frictionless ??

Trustless AI: io.net and Mira Lock In Reliability

Exploring AWS’s Role in Enabling Cloud-Based AI and ML

DeepThoughts on The Boom in AI Chips

?? Unlocking the Power of Azure AI! ??

What does Snowflake Acquiring Myst.AI do?