登录查看更多内容

Needle in a Haystack: Using LLMs to Search for Answers in Medical Records

Mario Schlosser

President of Technology and Co-founder at Oscar

发布日期: 2024年5月10日

Part 1 of a 3 part series

We are constantly thinking about how to make the workflows of clinicians less tedious, so they can do what they do best: serve patients and provide care. In a previous post on Messaging Encounter Documentation, we reviewed how our clinicians leverage AI to speed up documentation of their secure messaging visits with patients. In this next series of posts, we will share our learnings from a different application with the same goal: improving clinician workflows to deliver faster, better care for our members. We’ll delve specifically into how we can leverage LLMs to answer questions about the medical record.

Thank you to the Oscar AI pod for putting this together - here is the post on the Oscar Continuous AI Hackathon page, which you should be sure to follow. Now let's get into it.

Background

There are many workflows where clinicians must answer questions using a patient’s medical record. But these records are often dense with many unstructured sections which makes querying them cumbersome. Enter LLMs. They have the ability to both create structure out of unstructured natural language, and extract information from the medical record that can be used to answer targeted clinical questions.

One application where clinicians currently spend a lot of time digging through medical records is Prior Authorization (PA). When your doctor determines that you need a particular medical procedure, like a hip surgery, they submit a request to Oscar to get prior approval to perform the procedure. These PA requests are evaluated by clinical staff at Oscar for medical necessity, which usually takes the form of a set of clinical criteria derived from medical literature. A subset of these PA requests get automatically approved without human review by in-house technology, and we do not automatically deny requests. This saves everyone in the process time, getting the member the care they need as quickly as possible. The remainder require meticulous review from our clinicians. We wanted to understand whether GPT could accelerate our efforts to more quickly approve PA requests. We took a highly iterative approach to this problem, ultimately testing out 10+ different system design strategies.

For each experiment, we computed the accuracy, precision, and recall against historical requests and dove deeper into examples to understand where the responses were working and not. We then tweaked or changed the design for the next experiment—the complete set of experiments is represented in the table below (experiments noted with V2, V3, etc. contain small tweaks to the prompt based on our evaluation).

We focused on three main evaluation metrics?

Accuracy

In this case we want GPT to recommend approval for auths that should have been approved, and correctly flag auths where further review is needed.

Precision

What percentage of auths that were recommended for approval by GPT were actually approved.

Recall?

LinkedIn News Europe 5 个月前

Top Artificial Intelligence Companies In Healthcare To…

Bertalan Meskó, MD, PhD 3 个月前

Top 6 Medical Trends To Watch for 2024

Bertalan Meskó, MD, PhD 11 个月前

What percentage of all auth approvals was GPT able to capture.

Precision and recall are traded off with each other:?

If we approve everything, then recall is at 100%: we are able to catch 100% of all ‘true’ approvals. However, if precision is low: not all of the approvals are ‘true’ approvals - we are letting some requests that should have been raised for further review out the door, and the leakage is very high.
If we approve nothing, then recall is 0: we aren’t catching any of the ‘true’ approvals. However, precision is high: we aren’t letting anything out the door that shouldn’t be and there is no benefit for an AI augmentation system.

In general, we are more comfortable with the AI being conservative with approvals, as we want the AI to help augment the process and help with review. This is because the AI can never deny a claim –– so if the AI is conservative, the worst thing that will happen is that a human will have to review it and make a choice. IE - the status quo. In other words, we care more about high precision than high recall.?

That being said, if the AI is too conservative (i.e. the model never recommended approvals for anything), then it won’t actually streamline the process.?

You can see that in the end our best experiments were able to increase our precision to the 83-94% range. The results show that when our evaluation via GPT indicated a positive outcome, it matched the human assessment the vast majority of the time. But when we started, the results were much more humble.

Medical necessity is established using a collection of clinical criteria. One of the first strategies we tried (“Ask Questions 10 at a Time”) consisted of a mega prompt that attempted to evaluate all criteria in a single task. Not surprisingly, the model struggled with this.

More surprising is that a subsequent test (asking 1-3 questions at a time with function calling and citations – in the table it’s, “Function Calling Ask One by One with Citations”) actually did much worse. Without the context of other questions, it became confused on certain questions that it then evaluated to True, which led to false positives. To give a concrete example, GPT struggled with answering the following question:

Does the patient have a post-traumatic injury (e.g., fracture, infection) causing debilitating hip joint destruction affecting movement, causing pain or stiffness?

Instead of interpreting whether or not the member had a fracture or infection using information in the medical record, it gave a roundabout answer as to why a hip replacement was recommended (listing the conservative treatments that had been tried). It did not directly address whether a post-traumatic injury such as a fracture or infection had occurred. You can see a comparison of the two different cases below.

Comparison of Responses Asking Questions One at a Time vs. All at Once:

In using AI to improve clinician workflows, and streamline member care, it’s important to note that denials will always be handled by a human, not AI. We have not yet deployed this prototype and are still in the testing phase. In the next post we’ll delve deeper into the series of experiments and show how we iterated from early failures to much more promising results.

Katarina Polonska

The Science-Backed Love & Relationship Coach | Transforming Marriages of C-Suite Execs & Entrepreneurs From “LAST CHANCE” to “IN LOVE”?? | University of Oxford M.St | Successfully In Love Podcast??| Free Masterclass????

6 个月

Glad to see that it's actually proving useful...I'm not hearing such great things. This is interesting! Mario Schlosser

Sam Gross

I make good products great and eat snacks.

6 个月

A decade of work and finally, the dream! Being in the right position to collect the data and actually extract meaningful information pays off.

1 次回应

Mark stone

Manager Application Development & Support at Southern California United Food & Commercial Workers Unions and Food Employers Joint Benefit Funds

6 个月

Are suppressing PII/PHI before using ChatGPT. I would be concerned about the info being used in learning and/or caching and how that data may show up to others.

1 次回应

Edison Sabala, MBA, MPH

Neotypica Founder & CEO | Strategist | VC Advisor

6 个月

Next time you are in Miami, pass by for a beer and a chat on how Oscar Health is leveraging #ai for good???? Our community of builders would value your insights????

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Needle in a Haystack: Using LLMs to Search for Answers in Medical Records

Mario Schlosser

President of Technology and Co-founder at Oscar

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

All 521 FDA-Approved AI-Based Medical Technologies Analyzed – This And More News In Digital Health This Week

Accelerating the Use of Generative AI in Healthcare with Precision Regulation

3 Powerful Real-World A.I. Examples That Are Used by Patients in Healthcare – This And More News In Digital Health This Week

Keeping an Eye on AI: Key Considerations for FDA Evaluation of Generative AI-Enabled Medical Devices

What stops AI from excelling in healthcare

AI In Healthcare: How AI is Changing the Healthcare Industry?

The Greatest Gift of AI and Gen. AI to Healthcare is Time!

Humanizing Healthcare via Machine Learning and Artificial Intelligence

Minding the Machine: Assessing the Case for AI Regulations in Healthcare

A pragmatic suggestion for medical AI regulation

领英推荐

Do the Laws of Computation Imply That We Will Never Understand Machine Learning?

2024年9月21日

Superconductivity and Quantum Field Theory

2024年5月16日

Curious Language Model Limitations

2024年3月21日

Nature Is a Lazy Mathematician, Part 3

2023年9月28日

Nature Is a Lazy Mathematician, Part 2

2023年9月28日

Nature Is a Lazy Mathematician, Part 1

2023年9月28日

Quantum Mechanics & The Schr?dinger Equation

2023年8月23日

Using GPT-3 and Google Cloud Vision to Quantify the Visual Density of Languages

2023年1月22日

Tapping Into the Oscar Wisdom of the Crowd

2022年10月28日

10 Years of Oscar

2022年10月24日

社区洞察

其他会员也浏览了

All 521 FDA-Approved AI-Based Medical Technologies Analyzed – This And More News In Digital Health This Week

Accelerating the Use of Generative AI in Healthcare with Precision Regulation

3 Powerful Real-World A.I. Examples That Are Used by Patients in Healthcare – This And More News In Digital Health This Week

Keeping an Eye on AI: Key Considerations for FDA Evaluation of Generative AI-Enabled Medical Devices

What stops AI from excelling in healthcare

AI In Healthcare: How AI is Changing the Healthcare Industry?

The Greatest Gift of AI and Gen. AI to Healthcare is Time!

Humanizing Healthcare via Machine Learning and Artificial Intelligence

Minding the Machine: Assessing the Case for AI Regulations in Healthcare

A pragmatic suggestion for medical AI regulation