登录查看更多内容

Harnessing Open-Source GPT Models for Efficient SQL Reporting in Epic’s Clarity Healthcare Data

Jeremy Harper

Biomedical Informatician

发布日期: 2023年12月5日

In the realm of healthcare data analytics, the ability to swiftly extract actionable insights is pivotal. I’ve been working with GPT’s for a while and decided to start playing with and testing the open source capabilities to generate SQL. The top open source models to do this currently are Llama-33B, Mistral-7B, Vicuna-33B. I’ve been working with both Mistral and Llama. My recommendation is to work with Llama as it seems to me to be the best enterprise choice in the long run. Be aware that prior to the customized training on SQL the open source community is about half as good as the closed source specialist systems.

In particular I’m looking at the ability to do things against Epic's Clarity, a comprehensive data model employed by numerous healthcare institutions, which serves as a repository of vast patient and clinical data. The challenge lies in efficiently querying this repository to generate meaningful reports, a task often hindered by the complexity of SQL and their 20,000 tables. Here's where the advent of open-source Generative Pre-trained Transformer (GPT) models comes into play, offering a cost-effective and accessible solution for building robust SQL queries.

The Promise of Open-Source GPT Models

Open-source GPT models, akin to their closed-source counterparts, are adept at understanding and generating human-like text. In general as you might imagine they score less well than closed source systems with large teams developing against them. I’ve proven that these models can be fine-tuned to comprehend at least portions of the Clarity data model's intricacies, thus automating the translation of natural language requests into precise SQL queries. This capability will not only democratize data access (at least eventually), allowing non-technical staff to query databases without in-depth SQL knowledge but also accelerates the reporting process substantially.

Training on a Budget: CPU Over GPU

GPUs are the gold standard for training models due to their parallel processing capabilities, which significantly reduce training time for complex algorithms. Their architecture is specifically designed to handle the vast amounts of data and the intensive computational tasks inherent in machine learning. However, this cutting-edge technology comes at a steep cost, often running into thousands of dollars for high-end units, which can be a barrier for smaller healthcare institutions or research teams with limited funding. Moreover, the high demand and sometimes scarce availability of GPUs can lead to accessibility issues, further complicating their adoption for training large models like GPT.

Despite these challenges, the realm of open-source GPT models is the answer. These models are engineered with the versatility to accommodate training on CPUs – the more traditional and widely available computing cores found in standard computer systems. While CPUs are generally slower due to their sequential processing nature, recent advancements in software optimizations and parallel computing techniques have made it possible to train AI models on CPUs more effectively than ever before. It's a testament to the flexibility of open-source models that they can still learn and adapt over time without the need for high-end hardware. Admittedly, the training may not progress to the depths achieved with GPUs, and the timeframes will extend, but the end result can still be a robust, capable model.

The process of training on a CPU, albeit more protracted, opens doors for a wider range of institutions to participate in the AI revolution. It democratizes access to technology, particularly for healthcare organizations that wish to leverage AI for data analysis but face financial constraints. Training an AI model like GPT to interpret and generate SQL queries for Epic's Clarity data model using a CPU is a meticulous process. It requires careful data curation and strategic training methods to compensate for the lack of processing power. However, when done correctly, this approach can still yield a powerful tool for healthcare analytics. The AI, once trained, can understand the complex relationships and structures within the Clarity model, allowing for the automated generation of accurate SQL queries. This capability can significantly expedite report generation and data retrieval, which are crucial for timely decision-making in healthcare settings.

领英推荐

Issue #316 - The ML Engineer ??

Alejandro Saucedo 2 个月前

GenSQL: The AI-Powered SQL Revolution

ChandraSekhar Kalikivae 5 个月前

No SQL? No Problem! How to Query Your Data Assets with…

Adam Morton 11 个月前

Adapting to Epic's Clarity Data Model

Epic's Clarity data model is extensive and complex, designed to store a wide array of healthcare information. To tailor an open-source GPT model for Clarity, one must first pre-train it on a corpus of healthcare-related data, including typical queries and report formats used in the industry. This step ensures that the model grasps the specific jargon and data structure unique to healthcare. Ideally you will have clear examples of reports and their SQL already organized by your team. Many organizations will have to do this data cleansing step first.

Subsequently, the model undergoes fine-tuning, where it is exposed to the Clarity data model directly. During this phase, the model learns the relationships between different data elements within Clarity, improving its ability to generate accurate SQL queries reflective of the intricate relationships in healthcare data. I imagine a future where Epic itself will have to ingest their whole data model and offer it up as a service but with the customizations at each site you’ll always need to train on the queries you’ve built locally.

Challenges and Considerations

Training an open-source GPT model to interface with Clarity is not without its challenges. The model's accuracy is contingent on the quality and relevance of the training data. Another aspect to consider is the model's performance in a CPU-bound environment. To mitigate the slower training times, it's crucial to optimize the training process, such as by selecting the most impactful data for training and employing efficient data preprocessing methods. If you aren’t familiar with the Alpaca work, I’d recommend looking at it as its similar in regards to your fine tuning needs.

Impact on Report Generation

By automating SQL query generation, hospitals can drastically increase the number of reports generated, thus enhancing their operational efficiency. Clinicians and administrators can obtain timely insights into patient care, resource utilization, and hospital operations, enabling data-driven decision-making. This leap in reporting capabilities could significantly improve patient outcomes and hospital workflows.

The integration of open-source GPT models with Epic's Clarity presents a promising frontier for healthcare analytics. Even in the absence of extensive GPU resources, CPU-based training can produce models that effectively bridge the gap between natural language and SQL, empowering users to harness the full potential of their data. As these models continue to evolve, they hold the promise of revolutionizing how healthcare data is queried and understood, ultimately contributing to enhanced healthcare delivery.

要查看或添加评论，请登录

Jeremy Harper的更多文章

Accelerating Clinical Research Informatics Literature Review with Lightweight AI

2025年3月13日

Accelerating Clinical Research Informatics Literature Review with Lightweight AI

As a biomedical informatician, one of my persistent challenges has been efficiently reviewing the vast number of…

5 条评论
LLM Agent System to document Code

2025年3月7日

LLM Agent System to document Code

TLDR; new github repo with code to document other code. I saw the following post today, its a common problem in cutting…
Use an LLM for ETL first pass

2025年3月5日

Use an LLM for ETL first pass

Here's an example prompt to normalize datasets. I was talking to folks at HIMSS25 who didn't know how to build the…

5 条评论
Voice Cloning Breakthrough: Healthcare's New Communication Frontier

2025年3月3日

Voice Cloning Breakthrough: Healthcare's New Communication Frontier

The Game-Changing Arrival of Accessible Voice Cloning Technology Healthcare communication has reached a pivotal moment…
Data Visualization in Biomedical Informatics

2025年3月1日

Data Visualization in Biomedical Informatics

Below are two things I want you to see, the first is the prompt I used to have openAI's deep research module to have it…
Time to test 01 Pro's programming

2025年2月25日

Time to test 01 Pro's programming

I don't know if I'm bored or just brainstorming. I've been prepping the flooded basement for painting and realized my…

3 条评论
Comparing Life Outcomes: Homeschoolers vs. Public School Students in the U.S

2025年2月23日

Comparing Life Outcomes: Homeschoolers vs. Public School Students in the U.S

I'll research the differences in life outcomes between homeschoolers and public school students in the U.S.
Military Disability - Deep Research Overview

2025年2月23日

Military Disability - Deep Research Overview

I have friends being impacted right now and I was curious to understand both the perception of the impact as well as…
Investor and LLM Person?

2025年2月21日

Investor and LLM Person?

I don't know how many of you are investors and into LLM's but I just found a new use for deep research. It produces a…
Looking to understand the author landscape, Revenue, Ads, & Income

2025年2月18日

Looking to understand the author landscape, Revenue, Ads, & Income

This one started with some generic questions I've been asking about what its going to take to grind your way to success…

5 条评论

See all articles

Harnessing Open-Source GPT Models for Efficient SQL Reporting in Epic’s Clarity Healthcare Data

Jeremy Harper

Biomedical Informatician

领英推荐

Jeremy Harper的更多文章

社区洞察

其他会员也浏览了

Chat with your Data in the Database without writing SQL

PySpark GroupBy Guide: Super Simple Way to Group Data

Setting Up Vector Embeddings and Oracle Generative AI with Oracle Database 23ai

Fast Kullback-Leibler Divergence Using Spark

LLM Series Part 5 | How LLMs Can Chatify Your Database

Understanding Databases like Graph, Vector, and Relational Databases with Real-World Examples

From Analysts to Data Scientists

Unlocking the Power of Llama: Fine-Tuning an 8B Model to Generate SQL Queries with LoRA/QLoRA

Transformational shift of the time!

Pyspark Scenario based Realtime questions

领英推荐

Jeremy Harper的更多文章

Accelerating Clinical Research Informatics Literature Review with Lightweight AI

LLM Agent System to document Code

Use an LLM for ETL first pass

Voice Cloning Breakthrough: Healthcare's New Communication Frontier

Data Visualization in Biomedical Informatics

Time to test 01 Pro's programming

Comparing Life Outcomes: Homeschoolers vs. Public School Students in the U.S

Military Disability - Deep Research Overview

Investor and LLM Person?

Looking to understand the author landscape, Revenue, Ads, & Income

社区洞察

其他会员也浏览了

Chat with your Data in the Database without writing SQL

PySpark GroupBy Guide: Super Simple Way to Group Data

Setting Up Vector Embeddings and Oracle Generative AI with Oracle Database 23ai

Fast Kullback-Leibler Divergence Using Spark

LLM Series Part 5 | How LLMs Can Chatify Your Database

Understanding Databases like Graph, Vector, and Relational Databases with Real-World Examples

From Analysts to Data Scientists

Unlocking the Power of Llama: Fine-Tuning an 8B Model to Generate SQL Queries with LoRA/QLoRA

Transformational shift of the time!

Pyspark Scenario based Realtime questions