登录查看更多内容

LLMs and Sensitive Data

Jon Hill

Experienced Data Scientist Focused on Computational Biology

发布日期: 2023年5月31日

My colleague Victoria Gamerman, PhD recently shared an article from Tamer Chowdhury about architecture for using sensitive, access-controlled data in the context of ChatGPT and other Large Language Models (LLM).? This is vitally important area of development for these models, since many use cases in healthcare, business, and government are going to require a more nuanced approach than “everyone sees everything.”? The particular model that was being used couples an embedding database with the LLM.? Rather than cost-prohibitive retraining of the model, potentially relevant information is retrieved from the database and passed to the model as context at the time of query. This provides a point of access control, since you can simply not retrieve information to pass to the model that the user should not have access to.

This is a great solution for ensuring sensitive information doesn’t get out there. However, because of the way that an LLM deals with data, it is useful to think about degrees of knowledge when handling the problem. Take, for example, an employee who is looking for sensitive data from a study that they shouldn’t have access to.? If thoughtful engineering isn't put into managing how context is provided, it is possible that not only would the employee not get the sensitive information (good!), but they would not be aware of the existence of the data. Worst of all, they might get a totally incorrect answer, based on the best available data that the LLM is provided, without any awareness that this could be improved!

As a more concrete illustration, say a company had sensitive data on patients. I’m going to exaggerate and put in theoretical, entirely descriptive data. To a person with full access, the passage might be:

John Smith participated in the Megacompany trial for Awesometumumab, and received a single injection of 500 mg. His initial body weight was 300 kg at week 1, and was 280 kg at week 12.? Mild nausea was reported as an adverse event.

Someone with individual patient-level access might instead get the following context - this is the redaction of patient names model that was proposed in the Tamer Chowdhury article, and should be actually fairly straightforward to automate:

领英推荐

Data & Algorithms vs. Human Thought & Wisdom

Anurag Harsh 2 年前

Best Practices for Data Quality in AI-driven Insights

Forage AI 1 年前

Select avg(Moby Dick) limit 2 sentences

Tomasz Tunguz 8 个月前

Patient 1001 participated in the Megacompany trial for Awesometumumab, and received a single injection of 500 mg. His initial body weight was 300 kg at week 1, and was 280 kg at week 12.? Mild nausea was reported as an adverse event. Data has been anonymized.

Other interesting cases come up where it would be appropriate to share single data points, devoid of context, for assembling aggregate statistics

A patient in the one trial had a body weight of 300 kg.? A separate body weight reading was 280 kg. For details about specific trials, contact Jane Coordinator.

Or, alternately:

Mild nausea was a side effect in a patient.? For details about specific trials, contact Jane Coordinator.

Providing these different contexts to ChatGPT and then asking to whom the antibody was administered and in what dose parses things with the appropriate level of information to the recipient.? And also inserts the relevant contact person, when the data is possibly available, but hidden based on access rights.

For this reason, I believe that with LLMs, it is more important than ever not to simply remove data that shouldn’t be available, but, in the ideal case, to redact or aggregate the data to the appropriate level of security, and provide that to the model.? Also, providing context for how to actual access the data can reduce some of these silos. Since LLMs are more conversational than traditional databases, it is important to manage the risk of people assuming they are getting accurate and complete information, without some of these guardrails in place.

Victoria Gamerman, PhD

Innovation strategist bringing real-world evidence and digital transformation to patients | Value-driven leader | Keynote speaker

1 年

Here is a link to Tamer Chowdhury ‘s post and article on the topic. Data governance along the development chain of an LLM is value adding. https://www.dhirubhai.net/posts/tamer-chowdhury-9875684_unlocking-knowledge-from-quality-insights-activity-7060248390290300928-vS0q?utm_source=share&utm_medium=member_ios

1 次回应

要查看或添加评论，请登录

Jon Hill的更多文章

Partners in Science: Evolving from Student to Scientific Leader

2024年2月23日

Partners in Science: Evolving from Student to Scientific Leader

At Boehringer Ingelheim, our commitment to engaging with our local communities is deeply ingrained in our corporate…

1 条评论
Summarization and Prompting

2023年9月25日

Summarization and Prompting

I recently came across a preprint from Griffin Adams et al that covered a new approach called Chain of Density for…

2 条评论
Don’t Confuse Consistency with Quality

2023年9月13日

Don’t Confuse Consistency with Quality

Earlier this summer, I’d decided that it would be a good idea to learn Microsoft Power BI. This is a tool used to…

6 条评论
Leading Change

2023年6月13日

Leading Change

During recent travel, I had the opportunity to read Leading Change, by John Kotter, which is a sort of "business…

1 条评论
What if LLMs are GOOD for security?

2023年6月8日

What if LLMs are GOOD for security?

I had recently shared some thoughts on appropriate security access for LLMs on confidential data, but what if LLMs…

1 条评论
The Six (Prompting) Hats

2023年3月28日

The Six (Prompting) Hats

I had previously shared some impressions on the Six Thinking Hats method which was recommended by a colleague as a way…

1 条评论
Seeing Images in Single Cell Data (Pareidolia)

2023年3月22日

Seeing Images in Single Cell Data (Pareidolia)

This post will describe a bit of an unusual application for generative AI. To be honest, I’m still not sure if it falls…
ChatG-PPi-T: Finding Interactions with OpenAI

2023年3月6日

ChatG-PPi-T: Finding Interactions with OpenAI

In an earlier article, I’d posted about some mixed results in using the different LLMs provided by OpenAI to answer…

2 条评论
PowerPoint to Email with OpenAI

2023年3月2日

PowerPoint to Email with OpenAI

I was having a conversation with a colleague during his recent visit to the U.S.

9 条评论
Using Chat-GPT to Generate Structured Biological Knowledge

2023年2月26日

Using Chat-GPT to Generate Structured Biological Knowledge

After my previous post on using Chat-GPT to explain biological findings, I was interested in digging in a bit more…

11 条评论

See all articles

LLMs and Sensitive Data

Jon Hill

Experienced Data Scientist Focused on Computational Biology

领英推荐

Jon Hill的更多文章

社区洞察

其他会员也浏览了

DGIQ + AIGOV Conference 2024 Takeaways: Trending Topics in AI Governance

How to not fall for lies using data?

Insights from the 2024 Federal CDO Council Symposium: Data Powers Mission

The data age | The exponential growth of data will change the world

How Synthetic Data and SMOTE are Revolutionizing Consumer Insights

?? Weekly Dose of GenAI #14 ??

Computer Vision: Extending Your Company's Analysis

The Utility of Real-time Data Analytics in Healthcare

Synthetic Data: Modern Day Alchemy, or the Voice of the Under-Represented?

Summary of Singapore PDPC Advisory Guidelines on use of Personal Data in AI Recommendation and Decision Systems

领英推荐

Jon Hill的更多文章

Partners in Science: Evolving from Student to Scientific Leader

Summarization and Prompting

Don’t Confuse Consistency with Quality

Leading Change

What if LLMs are GOOD for security?

The Six (Prompting) Hats

Seeing Images in Single Cell Data (Pareidolia)

ChatG-PPi-T: Finding Interactions with OpenAI

PowerPoint to Email with OpenAI

Using Chat-GPT to Generate Structured Biological Knowledge

社区洞察

其他会员也浏览了

DGIQ + AIGOV Conference 2024 Takeaways: Trending Topics in AI Governance

How to not fall for lies using data?

Insights from the 2024 Federal CDO Council Symposium: Data Powers Mission

The data age | The exponential growth of data will change the world

How Synthetic Data and SMOTE are Revolutionizing Consumer Insights

?? Weekly Dose of GenAI #14 ??

Computer Vision: Extending Your Company's Analysis

The Utility of Real-time Data Analytics in Healthcare

Synthetic Data: Modern Day Alchemy, or the Voice of the Under-Represented?

Summary of Singapore PDPC Advisory Guidelines on use of Personal Data in AI Recommendation and Decision Systems