Patient journey comprehension to aid data harmonization
Introduction
One of Truveta’s core competencies is our ability to harmonize the electronic health care record (EHR) data of patients that we get from healthcare systems and make it available to life science researchers, in deidentified form, in a standard schema called the Truveta Data Model (TDM). ?
The schema explicitly leverages coding of clinical events according to appropriate ontologies, a process we will refer to as “data harmonization”. AI plays a big role in this data harmonization effort. In particular:
Challenges in harmonizing isolated data elements
The standard approach we take to perform a data harmonization task on a particular data element, such as a term, note or image, is to look only at the contents of that data element. The presumption is that there is sufficient signal available in the element itself to allow its proper harmonization, and by looking at human annotations of many such elements we can get enough supervision to enable AI to learn the appropriate harmonization function. However, we do see that we reach a certain ceiling of accuracy (precision + recall) with this approach due to insufficient mutual information in the element to harmonize it. As an extreme case, can you normalize the term “y” that is an observation value? Does it refer to the “yellow color” of urine or the answer “yes” to a question asked about allergies, or something else entirely? In some domains such as observations or devices the accuracy ceiling may be too low to be adequate for our data harmonization needs.
How can we solve this problem? Interestingly, human annotators are not constrained to look only at the specific data element they are annotating but are free to also look at other relevant data elements from the patient’s EHR. Their aim is to provide accurate ground truth for the data elements, and they do see a big improvement in doing that from looking at adjacent or related data elements. This provides us a valuable hint!
Patient journey summarization for data harmonization
Our hypothesis posits that there is significant additional signal about a data element’s ground truth residing in other parts of the patient journey ?and by exploiting that additional context we can significantly elevate the accuracy ceiling mentioned earlier. Conversely, leveraging the contextual information could allow us to achieve our target accuracy goal with reduced human supervision during training.
?
We could take an incremental approach to proving and exploiting this hypothesis – say by starting with a specific harmonization task, such as mapping a concept from a diagnosis event – and looking at some handpicked ancillary data elements such as other notes, or images, or EHR events occurring in its temporal vicinity. We could estimate the accuracy lift available from the supplementary context, which would help us develop our hypothesis and perhaps bring in even more context or prune some of the chosen context. However, even if successful, this process will likely be a laborious effort that would need to be repeated for every harmonization task, and hence may not be scalable.
Instead, what if we took a more generalized and automated approach leveraging AI itself to produce this additional context? Specifically, let us view the entire patient journey as a time series of clinical events as shown in the figure above. For a specific data harmonization task such as normalizing a diagnosis and related observation events, we want to extract relevant information from the entire patient journey while filtering out any noise. This could be done by a summarization agent based on a Large Language Model (LLM). The prompt of that summarization task would obviously be tuned to the data harmonization task at hand. More importantly, the prompt would also be tuned for the specific data element instance being harmonized.
领英推荐
Continuing with the example above, suppose that when normalizing the diagnosis, the raw string says “M.N. prst., anap” or “CA PRSTT, ANPLSTC”. ?This string can be passed in the context of the summarization agent, and it would pull out a summary of the events in the patient journey that are most relevant to normalizing that string. ?In this example, previous (as well as future) events in the patient journey pertinent to high levels of prostate-specific antigen (PSA) and enlargement of prostate could be very informative and the AI model would have more confidence in mapping the string to the concept “malignant neoplasm of prostate, anaplastic”. In contrast, lack of any such contextual information would make it have low confidence, especially if a similar abbreviation was never seen in training data.
The future is as informative as the past
Interestingly, the patient journey summary context need not be limited to only being from the past, but can also look at the future of the data element being normalized. After all, this is a data harmonization task, where all the events have already happened and are recorded in the patient journey and there is nothing wrong in looking at the subsequent events to gather signals to harmonize earlier events. In other words, data harmonization is not a causal task. ?In contrast, in a truly predictive task like predicting risk of disease in the future based on the past journey, we would of course have to constrain to only using causal features.
Iterative harmonization
An interesting extension to our approach involves introducing an iterative process into the harmonization process. Instead of treating the extraction of context from the patient journey as a feed-forward operation, we could implement an iterative feedback mechanism. This iterative process would begin with an initial, context-free harmonization attempt. This preliminary output would then be used to extract a contextual summary from the broader patient journey, which in turn would be fed back into the harmonizer to generate a refined estimate. This refined output would enable the gathering of a more nuanced context, which could be iteratively looped back into the system. By repeating this cycle, we would expect the harmonization accuracy to improve incrementally with each iteration, ultimately converging on a more accurate and robust result.
The success of this iterative system hinges on carefully managing the information flow to ensure stability, specifically by circulating only “extrinsic information”—that is, information that was not part of the harmonizer’s initial input or immediate outputs. This approach is analogous to iterative decoding techniques such as Turbo decoding used in error correction codes, where iterative refinement and feedback are fundamental. Analytical tools like Extrinsic Information Transfer (EXIT) charts could be employed to model and predict the behavior and stability of such a system.
We also have evidence from agentic frameworks used in assistive chatbots that this iterative approach works well. In fact, reflection and iteration with message passing across independent agents are essential aspects of agentic frameworks.
Such an iterative system will not simply couple the summarization and harmonization task pertinent to a specific data element but will also couple all the harmonization tasks across the whole journey. As each data element in the patient journey gets normalized adequately, the summarizers of other data elements benefit from that in the sense that their summary becomes more pertinent and less noisy. This is illustrated by the diagram above. This cross-communication between the data harmonizers will give an additional lift in accuracy and reduce reliance of large-scale human annotation.
Acknowledgement: I would like to thank Alireza Ghods for his contributions in writing and reviewing this article.
Data Engineer
5 个月Nice Article