Buying and Selling Image Data, A Practical Solution for VNA, AI and Clinical Research
Kyle Henson, DM
Helping CEOs & CIOs fix every IT issue in their healthcare system. If it’s broken, call me and your problem is solved.
Every now and then I am asked about, or I read an article about, someone selling massive amounts of data to one of the big companies involved in AI and/or clinical research. And, when you have a lot of data the obvious thought is, I want some of that free money!
As a thought-exercise lets look at some of the realities of moving more than a PetaByte of image data. A PetaByte is 1,024 TeraBytes or 1,048,576 GigaBytes. Many, but not all, VNA’s store data in a proprietary format that is like DICOM but is not a straight .dcm file. This means that to get data out you can’t simply copy the file but instead, have to do a DICOM transaction. Whether the data is stored as a DICOM file or not, both data transfer types require the additional step of de-identification, so although DICOM transactions add time, it is not the end of the world.
In my experience a single server tops out at somewhere around 15,000 studies per day, which is approximately 500GB. So, less than 5/100th of 1% of that PetaByte of data is moved. Doing the simple math, 10 servers dedicated to nothing but copying this data, ignoring a penalty for de-identification or additional compression will move 1 PB in 209 days. I submit that this is not practical and there is a better way.
First, we are looking at the problem from the wrong end. Whether for clinical research or training an artificial intelligence (AI) engine, it is likely that the buyer doesn’t want ALL data, but is instead looking for very specific use cases. In particular, what diagnosis are they trying to research or train? Instead of dumping billions of images on them and letting the buyer sort through it all, I propose that preparing a system that can provide a targeted data set, instead of a generic query like “send me all chest x-rays” will go a long way in cultivating a long-term relationship. To achieve this targeted approach, we must begin at the report level, not the images.
To initiate this targeted method, we would build a database that holds all reports (not images) for the enterprise. Simply start with pulling an extract from the EMR for all existing reports, and then add the HL7 or FHIR connection to get all new reports. With the reports parsed into the database any future questions or required data parameters, can be answered with relative ease. This database can then be queried for the specific data set desired, the output of this query would be accession number, patient ID, date of service and procedure description. Obviously, there should be a 1-1 relationship between accession number on the report and the images in VNA, but the additional output data will help if there is an accession number mismatch.
Now, armed with this export, a savvy VNA team can, instead of drowning the buyer in millions of “chest x-rays”, provide them with the several thousand “non-smoker, males, between the ages of 15-30 with a lung cancer diagnosis” files they actually want. I am not a researcher, but I suspect that this type of fine-tuned data capture would be more beneficial to them, as well as much easier to service from the VNA, in effect a win-win for all involved.
To view all of my articles and posts visit me at kylehenson.org
AI Architect (Model & ML Eng.) with depth and breadth for real world ML solution in healthcare and life science, strong science and engineering discipline with creative and curious mind
6 年Thanks Kyle Henson, actually it is pretty common, and good you brought up. Data Scientist (researcher) have a very good reason to say: I need all data: it is about data discovery, kind “chicken and egg” thing. Data Scientist (researcher) often does not have enterprise experience. Data complexity (schema, dependency, quality issue and historical embedding) and cost (money & time) is not part of “need all data” thinking. I used to build "data scientist friendly" pipeline or layer to support their work before I became one of them ?? At same time, I have some trouble to buy-in the notion: one can handle 1,000 PB data daily, one knows how to handle the data, well, volume is not necessary the key, enterprise complexity is 1,000 PB more difficult than web traffic or user log… one should learn from the vertical search.
Healthcare IT Analyst and Solutions Specialist
6 年Hi Kyle, I deleted that earlier comment because I believe I may have framed it incorrectly and thinking out loud on any social platform just sometimes shows ignorance. So, let me back up and 1st say, thank you for writing the article as predictive analytics and ML with respect to DICOM imaging is not something I come across very often. I’m curious to understand it better as I have seen what you’ve described before and I expect we will be seeing a lot more of it as we start to collect more and more healthcare image data. What I am aiming to understand, related to your article, is, do we get more accurate predictions if we standardize image training data sets before applying a statistical model-taking the type of image data (US, XA, CT, MR, etc.) completely out of the equation or does it not matter? For example, let’s say we wanted to try and predict whether an acre of land would have more evergreens vs. hardwoods (& vise-versa) in a particular region of a particular forest. Would our resultant predictive model be more or less accurate if we...threw in some images of cats and birds (something other than images of evergreen or hardwood trees)? Please forgive my newbie understanding, but as I stated, I want to understand it better.
Instructor at MTMI specializing in PACS Administration and AI
6 年Great article! Thanks Kyle
Team Lead, Product Management - Cloud and Technical Platforms
6 年The other biggest challenge outside of the EMR/ Report Data is the segmentation of a positive finding to teach AI the imaging characteristics!
The AI-guy, Assisting in AI technology deployment, entrepreneur, expert trainer/consultant on PACS, interoperability, standards.
6 年Yes I agree, a PACS archive or VNA is optimized to support the clinical workflow of diagnosing and reviewing medical images and definitely not optimal to provide a source for data analytics, decision support or deep learning (AI). You'll need to copy relevant information into a datawarehouse which can serve that purpose, joined as you mention with hl7 and emr info. This might change when FHIR gets widely deployed as it would allow resources to be queried for that purpose.