Analysing non-tabular healthcare data with REFORM and Snowflake
Introduction
Just under a year ago Keshav Murthy, Vice President at CouchBase R&D, published this article detailing how to derive insights from a JSON dataset from CDC.gov.
11 months is a long time in the data world and a lot has changed between then and now. Snowflake has grown tremendouslyand REFORM was released, setting a new standard for scalable self-service JSON, MongoDB, data lake and Web API connectivity.
This article follows in Keshav's footsteps, showing how to derive insights from JSON healthcare data using Snowflake and REFORM.
Snowflake is an incredibly powerful and cost effective data warehouse. However, working with non-tabular data using Snowflake alone is an involved task that is time consuming, prone to error and not suited for self service.
REFORM empowers even non-technical users to easily browse and curate tables from tabular and non-tabular data. They can then load these tables directly into software like Power BI and Tableau, data lakes, databases such as Postgres and warehouses such as Snowflake.
REFORM is a universal self-service connector for non-tabular data such as MongoDB, JSON data in S3, data lakes, Azure Blob Storage, SaaS applications and web APIs such as Airship, Benchling, WooCommerce and HubSpot. REFORM reduces development, integration, testing and compute times whilst improving accuracy. It empowers analysts with self-service access to tabular and non-tabular data to fulfil their own data requests.
Data
JSON data comes in many forms. Some of these are essentially tables and others are non-tabular data. Keshav Murthy's article uses the Causes of death dataset which is small and tabular. It is also available from CDC.gov as CSV.
In lieu of this, dataset we'll use the similarly grim adverse events related to drugs indicated for osteoarthritis dataset from the FDA. This data is non-tabular and more representative of data from web apis, internet of things devices, data lakes and applications. It is paginated and over a gigabyte in size. This dataset is made up of adverse reaction reports which exhibit differences in structure. Each report includes a list of medicines and a list of the adverse reactions.
Important disclaimer
This article uses data from openFDA which issues the following disclaimer.
Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products.
As such you should not rely on this document to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of products. Neither myself nor SlamData are health providers or researchers.
Methodology
The first step to getting insight from non-tabular data, such as this dataset, is to tabulate it. This allows us to make charts, train and predict using machine learning models and visualise the data using software such as Tableau, Looker, ThoughtSpot and Power BI.
In this article, we'll explore three approaches to tabulating this JSON healthcare data; using REFORM alone, using REFORM with Snowflake and using Snowflake alone. These approaches can be applied to any non-tabular data such as apps backed by MongoDB, JSON data in S3, Azure Blob Storage, SaaS applications and web APIs such as Airship, Benchling and HubSpot.
Once the data is tabulated it can be easily:
- Analysed with applications such as Tableau, Power BI, Looker and ThoughtSpot
- Used to train and predict with machine learning models using software such as TensorFlow and DataRobot
- Manipulated using standard SQL and loaded into databases or warehouses such as Postgres, Snowflake and Redshift
How to follow along
REFORM Desktop and Snowflake are available as a free trials. PowerBI is available free as well.
If you get REFORM Desktop and Snowflake you can follow along with the "Tabulating non-tabular data with REFORM", "Tabulating non-tabular data without REFORM" and "Analysing the tabulated data with SQL" sections.
If you get REFORM Desktop and Power BI you can follow along with the "Tabulating non-tabular data with REFORM" and "Analysing the tabulated data with Power BI" sections.
If you prefer a server or private cloud based solution, Docker, AMI, CloudFormation, JAR and Google Cloud based trials of REFORM are also available. Please contact [email protected].
Tabulating non-tabular data with REFORM
With REFORM, we connect directly to the source of the data. In this case the source is the FDA.gov Web API. This ensures that we are always working with the latest data including new records and corrections. There is no need to download, stage or import the data. Self service users can add their own sources and commonly used sources can be added in advance so that self-service users don't need to do this themselves.
Like many web data sources, FDA.gov provides data in a paginated manner. This means that the dataset is not available as a single document, rather it is made up of multiple documents. REFORM handles this pagination for us.
We'll use the "Adverse events relating to NSAID drugs indicated for Osteoarthritis" dataset for our analysis: https://api.fda.gov/drug/event.json?limit=100&search=patient.drug.drugindication.exact:OSTEOARTHRITIS
In order to retrieve all the pages we'll need to get an API key from FDA.gov, provide this as a basic authentication username then select "Paginated via header" as the request type. REFORM also supports offset and token based pagination.
REFORM instantly shows us the available data and allows us to browse it. To tabulate the data we simply pick what we are interested in and it appears as columns in the table below. Here we are interested in the id of the reports, as well as reactions and information about drugs.
Now that we've picked out our table we can analyse the data directly, using software such as Power BI, DataRobot, Alteryx and ThoughtSpot, or push it into warehouses and databases such as Snowflake and Postgres. Every time we access the results REFORM provides the latest data from the source.
We can think of using REFORM like online shopping, except instead of a cart we have a table. We can browse, pick what we want and then check out. Now that we've "checked out" we can use software such as Power BI, ThoughtSpot, Looker, Tableau or SQL to analyse and manipulate the data.
To summarize, we have:
- Added the FDA Web API as a data source, letting REFORM automatically deal with the paginated data
- Browsed the available data fields from the non-tabular data
- Picked the data were are interested in, letting REFORM automatically tabulate the data for us
- Loaded the tabulated data into Snowflake or Power BI
By doing this we have now successfully paged, downloaded, imported, explored and tabulated the non-tabular data.
Tabulating non-tabular data without REFORM
When using Snowflake without REFORM, the data must first be downloaded from the source. This immediately makes stale data more likely and introduces risks associated with incorrect or outdated results.
Dealing with pagination
Like many web data sources, FDA.gov provides data in a paginated manner. This means that the dataset is not available as a single document but is available as multiple documents. Without REFORM, we must first download all the pages with a bash command such as the following:
next=$({ curl -v -g 'https://api.fda.gov/drug/event.json?limit=100&search=patient.drug.drugindication.exact:OSTEOARTHRITIS&api_key=<REDACTED>' 2>&1 1>event.json | perl -0777 -ne 'print "$&\n" if /(?:(?!<).)*?(?=>; rel="next")/'; }) && while [ "$next" != "" ]; do next=$({ curl -v -g "$next&api_key=<REDACTED>" 2>&1 1>>event.json | perl -0777 -ne 'print "$&\n" if /(?:(?!<).)*?(?=>; rel="next")/'; }); done
You will need to replace <REDACTED> with an OpenFDA API key.
Loading the data into Snowflake
Snowflake allows us to load the data into a table that has a column with the type variant.
Unfortunately, Snowflake's UI prevents us from loading this dataset.
Instead, we can download and install SnowSQL, then load our data as follows (making sure to set our warehouse to XSMALL before beginning so that we don't spend all of our credits):
snowsql -a slamdata -u becky create or replace warehouse example with warehouse_size = xsmall auto_suspend = 300; use warehouse EXAMPLE; use database TEST_DB; create or replace file format json_format type = 'JSON' strip_outer_array = false; create or replace stage raw_json file_format = json_format; put file:///Users/beckyconning/event.json @raw_json; create or replace table event_raw(data variant); copy into event_raw from @raw_json;
As shown in this error, Snowflake has a limit on the size of JSON documents. REFORM does not have any limit on the size of JSON documents and is tested on datasets containing thousands of 30GB documents.
100069 (22P02): Error parsing JSON: document is too large, max size 16777216 bytes File 'stages/e15cb2fa-ac92-4a2d-95af-fee5665a904a/event.json.gz', line 7881707, character 47 Row 40 starts at line 7447303, column DATA If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
When we don't have control over the size of our JSON documents, we cannot use Snowflake alone to analyse them. Luckily, in this case, we can reduce the number of records in each page retrieved by our script. This will take about four times longer to retrieve.
next=$({ curl -v -g 'https://api.fda.gov/drug/event.json?limit=25&search=patient.drug.drugindication.exact:OSTEOARTHRITIS&api_key=<REDACTED>' 2>&1 1>event.json | perl -0777 -ne 'print "$&\n" if /(?:(?!<).)*?(?=>; rel="next")/'; }) && while [ "$next" != "" ]; do next=$({ curl -v -g "$next&api_key=<REDACTED>" 2>&1 1>>event.json | perl -0777 -ne 'print "$&\n" if /(?:(?!<).)*?(?=>; rel="next")/'; }); done snowsql -a slamdata -u becky create or replace warehouse example with warehouse_size = xsmall auto_suspend = 300; use warehouse EXAMPLE; use database TEST_DB; create or replace file format json_format type = 'JSON' strip_outer_array = false; create or replace stage raw_json file_format = json_format; put file:///Users/beckyconning/event.json @raw_json; create or replace table event_raw(data variant); copy into event_raw from @raw_json;
Exploring the data in Snowflake
We can now start to explore the non-tabular data by dissecting a small sample of the raw data.
select * from event_raw limit 1; select results.value from event_raw, lateral flatten (input => data, path => 'results') results limit 1;
Which returns the following:
+-------------------------------------------------------------------+ | VALUE | |-------------------------------------------------------------------| | { | | "companynumb": "US-PFIZER INC-2015467764", | | "duplicate": "1", | | "fulfillexpeditecriteria": "2", | | "occurcountry": "US", | | "patient": { | | "drug": [ | | { | | "activesubstance": { | | "activesubstancename": "INSULIN GLARGINE" | | }, | | "drugcharacterization": "2", | | "drugdosagetext": "46 UNIT, UNK", | | "drugintervaldosagedefinition": "804", | | "drugintervaldosageunitnumb": "1", | | "drugseparatedosagenumb": "1", | | "drugstartdate": "20141120", | | "drugstartdateformat": "102", | | "drugstructuredosagenumb": "46", | | "drugstructuredosageunit": "032", | | "medicinalproduct": "LANTUS", | | "openfda": {} | | }, | | { | | "activesubstance": { | | "activesubstancename": "INSULIN GLARGINE" | | }, | | "drugcharacterization": "2", | | "drugdosagetext": "50 UNIT, QD", | | "drugintervaldosagedefinition": "804", | | "drugintervaldosageunitnumb": "1", | | "drugseparatedosagenumb": "1", | | "drugstartdate": "20150323", | | "drugstartdateformat": "102", | | "drugstructuredosagenumb": "50", | | "drugstructuredosageunit": "032", | | "medicinalproduct": "LANTUS", | | "openfda": {} | | }, | | { | | "activesubstance": { | | "activesubstancename": "SULFAMETHOXAZOLE\\TRIMETHOPRIM" | | }, | | "drugadministrationroute": "065", | | "drugcharacterization": "2", | | "drugdosagetext": "1 TAB, BID", | | "drugindication": "CHRONIC SINUSITIS", | | "drugintervaldosagedefinition": "804", | | "drugintervaldosageunitnumb": "1", | | "drugseparatedosagenumb": "2", | | "drugstructuredosagenumb": "1", | | "drugstructuredosageunit": "032", | | "medicinalproduct": "BACTRIM DS", | | "openfda": { | | "application_number": [ | | "NDA017377" | | ], | | "brand_name": [ | | "BACTRIM DS" | | ], | | "generic_name": [ | | "SULFAMETHOXAZOLE AND TRIMETHOPRIM" | | ], | | "manufacturer_name": [ | | "Sun Pharmaceutical Industries, Inc" | | ], | | "package_ndc": [ | | "49708-146-01", | | "49708-145-01" | | ], | | "product_ndc": [ | | "49708-146", | | "49708-145" | | ], | | "product_type": [ | | "HUMAN PRESCRIPTION DRUG" | | ], | | "route": [ | | "ORAL" | | ], | | "rxcui": [ | | "198334", | | "198335", | | "849580", | | "208416" | | ], | | "spl_id": [ | | "72efc062-e8e3-bc81-e053-2991aa0ac4d8" | | ], | | "spl_set_id": [ | | "f59d0c04-9c66-4d53-a0e1-cb55570deb62" | | ], | | "substance_name": [ | | "TRIMETHOPRIM", | | "SULFAMETHOXAZOLE" | | ], | | "unii": [ | | "AN164J8Y0X", | | "JE42381TNV" | | ] | | } | | }, |
... the sample represents a single report and is just under 3000 lines long. As such I've only included a few lines from the start and end here. To view the full sample click here...
| "primarysource": { | | "qualification": "1", | | "reportercountry": "US" | | }, | | "primarysourcecountry": "US", | | "receiptdate": "20160830", | | "receiptdateformat": "102", | | "receivedate": "20160830", | | "receivedateformat": "102", | | "receiver": { | | "receiverorganization": "FDA", | | "receivertype": "6" | | }, | | "reportduplicate": { | | "duplicatenumb": "US-PFIZER INC-2015467764", | | "duplicatesource": "PFIZER" | | }, | | "reporttype": "1", | | "safetyreportid": "12698542", | | "safetyreportversion": "1", | | "sender": { | | "senderorganization": "FDA-Public Use", | | "sendertype": "2" | | }, | | "serious": "2", | | "transmissiondate": "20161109", | | "transmissiondateformat": "102" | | } | +-------------------------------------------------------------------+
This query is not standard SQL and only works in Snowflake.
From carefully examining this sample, we can determine the parts we are interested in, such as the report id, the drug, pharmaceutical class, inidication, characterization and the reactions.
Tabulating the non-tabular data in Snowflake
By taking note of the structure of the sample, we can adjust our query by adding flatten statements for the drug, reaction arrays and selecting the projections we are interested in from those flattens.
select results.value:safetyreportid as report_id, reactions.value:reactionmeddrapt as reaction, drugs.value:medicinalproduct as drug, drugs.value:drugindication as drug_indication, drugs.value:drugcharacterization as characterization from event_raw, lateral flatten (input => data, path => 'results') results, lateral flatten (input => results.value, path => 'patient:reaction') reactions, lateral flatten (results.value, path => 'patient:drug') drugs limit 10;
Dealing with the varying structure of non-tabular records
The structure of non-tabular data can change from record to record. This is the case with this FDA data. One example of this is the pharmaceutical classes of drugs. Some records contain this information and others don't. The sample we examined above doesn't contain this data.
REFORM shows us all the available fields in the dataset. We can pick these and REFORM will automatically take into account differences in structure, giving us access to the all the data.
If the same information is stored in different formats and structures, we can easily convert these and merge them together into a single column in the REFORM user interface.
Without REFORM, we have to examine samples of raw data. The sample we examined above does not contain the pharmaceutical classes of drugs, so without REFORM we may be unaware of this information.
Even if we were aware of this information, we would need to factor the differences in structure into our query. We might think that the following query would be a sensible way of doing this.
select results.value:safetyreportid as report_id, reactions.value:reactionmeddrapt as reaction, drugs.value:medicinalproduct as drug, drugs.value:drugindication as drug_indication, drugs.value:drugcharacterization as characterization, classes.value as drug_class from event_raw, lateral flatten (input => data, path => 'results') results, lateral flatten (input => results.value, path => 'patient:reaction') reactions, lateral flatten (results.value, path => 'patient:drug') drugs, lateral flatten (input => drugs.value, path => 'openfda:pharm_class_epc') classes limit 10;
However this silently excludes all of the records which do not include drug class information.
In order to select all the data, we need to provide an additional outer => true parameter to the flatten statement.
select results.value:safetyreportid as report_id, reactions.value:reactionmeddrapt as reaction, drugs.value:medicinalproduct as drug, drugs.value:drugindication as drug_indication, drugs.value:drugcharacterization as characterization, classes.value as drug_class from event_raw, lateral flatten (input => data, path => 'results') results, lateral flatten (input => results.value, path => 'patient:reaction') reactions, lateral flatten (results.value, path => 'patient:drug') drugs, lateral flatten (input => drugs.value, path => 'openfda:pharm_class_epc', outer => true) classes limit 10;
We can now save these results for future use.
create table drug_reactions as select results.value:safetyreportid as report_id, reactions.value:reactionmeddrapt as reaction, drugs.value:medicinalproduct as drug, drugs.value:drugindication as drug_indication, drugs.value:drugcharacterization as characterization, classes.value as drug_class from event_raw, lateral flatten (input => data, path => 'results') results, lateral flatten (input => results.value, path => 'patient:reaction') reactions, lateral flatten (results.value, path => 'patient:drug') drugs, lateral flatten (input => drugs.value, path => 'openfda:pharm_class_epc', outer => true) classes;
Summary
To summarize, we have:
- Written a program to download paginated data from the FDA API
- Installed snowsql and run it at the command line
- Created a stage, data format and raw table
- Modified our program to reduce the document size
- Staged and loaded the data into the raw table
- Used a flatten statement to examine a sample of the raw non-tabular data
- Examined the raw source data to discover the data we're interested in
- Used special knowledge to be aware of data that didn't appear in the sample
- Written a query with multiple flatten statements and parameters to tabulate the data
By doing all of this now successfully paged, downloaded, imported, explored and tabulated the non-tabular data without using REFORM.
Conclusion
Snowflake is an incredibly powerful and cost effective data warehouse. However, working with non-tabular data using Snowflake alone is an involved task that is time consuming, prone to error and not suited for self service.
REFORM empowers even non-technical users to easily browse and curate tables from non-tabular data and load these tables directly into software such as Power BI and Tableau, databases such as Postgres and warehouses such as Snowflake.
REFORM is a universal connector for non-tabular data such as MongoDB, JSON data in S3, data lakes, Azure Blob Storage, SaaS applications and web APIs such as Airship, Benchling, WooCommerce and HubSpot. REFORM reduces development, integration, testing and compute times whilst improving accuracy and empowering analysts with self-service access to tabular and non-tabular data to fulfil their own data requests.
Analysing the tabulated data in Power BI
Now that the data has been tabulated, we can analyse it using software such as Microsoft Power BI.
PowerBI can either connect directly to REFORM, or to a database or warehouse such as Snowflake or Postgres where REFORM has pushed the data.
To connect Power BI directly to REFORM, go to the results page for the adverse drug reactions table in REFORM, then create and copy a reusable access link.
Next open Power BI, click "Get Data" then click "Web".
Then paste the access link and click "OK".
REFORM pages and tabulates the latest data from the OpenFDA API and provides the latest results to Power BI.
Click "Transform Data".
Set "Name" to "DRUG_REACTIONS" and remove the "Changed Type" step. Then click "Close & Apply".
The data is now available in PowerBI for analysis and visualisation.
The FDA states that
When a report lists multiple drugs and multiple reactions, there is no way to conclude from the data therein that a given drug is responsible for a given reaction.
As such let's add a column which denotes whether each reaction is conclusive or not.
Click "Modeling", click "Add Table", then paste the following
STANDARD_REPORTS = ADDCOLUMNS(SUMMARIZE(DRUG_REACTIONS, DRUG_REACTIONS[REPORT_ID],"DRUGS",DISTINCTCOUNT(DRUG_REACTIONS[DRUG]),"REACTIONS",DISTINCTCOUNT(DRUG_REACTIONS[REACTION])),"STANDARD",[DRUGS]=1||[REACTIONS]=1)
Then click "Manage relationships", click "New", select "DRUG_REACTIONS", set "Cross filter direction" to "Both", click "OK" then click "Close".
Next insert a "Donut chart". Drag CONCLUSIVE from CONCLUSIVE_REPORTS to "Legend" and drag REPORT_ID from CONCLUSIVE_REPORTS to "Values".
This chart shows that the majority of reports are not conclusive. If reporting was changed such that reactions were listed per drug, then all new reports would be. Of course such a change is quite a large endeavour.
The FDA provides access to the following chart, but the usefulness of this is limited. It includes non-conclusive reports, drugs which weren't indicated for osteoarthritis and drugs which weren't suspected to have caused the reported adverse reactions.
We can produce our own chart without these limitations.
Insert a "Stacked bar chart", drag DRUG_CLASS to "Axis" and drag REPORT_ID from CONCLUSIVE_REPORTS to values.
Currently this bar chart is the same as the one provided by the FDA with one key difference. Our chart shows that the vast majority of records don't contain drug class information. As such we'll need to drill down into the individual drugs themselves. Before we do this, we will remove the limitations of the FDA chart.
Drag CONCLUSIVE, DRUG_INDICATION and CHARACTERIZATION to "Filters on this vidual". Select conclusive as "True", drug indication as "Osteoarthritis" and characterization as "1" (this means the drug was suspected as the cause of the reaction).
Only two drug classes are listed. There are 9136 reports containing drugs with no drug classification information. Out of the drugs whose pharmaceutical classes are included, 68 are NSAIDs and 1 is a Platelet Aggregation Inhibitor.
Next let's drill into the drugs themselves. Copy and paste the bar chart and replace DRUG_CLASS with DRUG. This shows some drug names containing full stops which we should remove.
Right click on DRUG_REACTIONS and click "New column", paste the following then replace DRUG with SANITIZED_DRUG in our newest bar chart.
SANITIZED_DRUG = SUBSTITUTE([DRUG],".","")
These values do not mean that drugs are more or less dangerous than each other, as there are many factors that affect reporting of adverse events. Please refer to the important disclaimer at the start of this article.
This chart shows that as expected the majority of reported adverse reactions involving drugs indicated for osteoarthritis are NSAIDs. NSAIDs are often prescribed for osteoarthritis. Also included in this list are a TNF inhibitor, a monoclonal antibody and a gabapentinoid.
Other drugs which are often prescribed for osteoarthritis include acetaminophen and opioid analgesics however these don't feature prominently in our chart.
Vioxx is an NSAID which was approved by the FDA in 1999 and voluntarily removed from the US market in 2004 due to concerns about risks of heart attack and stroke. Despite only being on the market for a short period, it is prominently featured on this list. Bextra (also known as Meloxicam) is an NSAID which was approved by the FDA in 2001 and removed from the US market by the FDA in 2005. The other drugs in this list are approved at the time of writing.
Celebrex (celecoxib), Voltaren and Ibuprofen are NSAIDs. Celebrex is reported as the suspected cause of adverse reaction more than twice as often than Ibuprofen. Some organisations, such as Best Practice Advocacy Centre New Zealand state that "A significant benefit of celecoxib is that it is associated with less risk of gastrointestinal bleeding compared to non-selective NSAIDs". Let us see if the BPACNZ statement is reflected in this dataset.
Copy and paste the last bar chart and replace SANITIZED_DRUG with REACTION. Then drag SANITIZED_DRUG to "Filters on this visual". Search for "Celebrex" and select all drugs that only contain Celebrex. Search for "Celecoxib" and select all drugs that only contain Celecoxib. Avoid values marked as "Blinded" and values that contain drugs other than Celebrex and Celecoxib.
Copy and paste the Celebrex reactions bar chart, click "Select all" then select all drugs that only contain Ibuprofen. Avoid values marked as "Blinded" and values that contain drugs other than Ibuprofen.
Then for each of the two newest bar charts select them and click "Table".
Gastrointestinal bleeding is the 15th most reported adverse event where Celebrex is suspected and the 1st most reported adverse event where Ibuprofen is suspected. This seems to reflect BPACNZ's statement, however, there are many possible causes for these results. These results should not be used to make decisions about medical care.
Let's summarize our insights into adverse reaction reporting of drugs indicated for osteoarthritis.
- Out of 23217 reports, only 10457 could be used to conclude that a given drug is suspected to be responsible for a given reaction.
- Out of the 10457 resulting reports only 246 contained drug classification information.
- There are more than twice as many adverse reaction reports where Celebrex is suspected as responsible than there are where Ibuprofen is suspected as responsible.
- Gastrointestinal bleeding is the 15th most reported adverse event where Celebrex is suspected and the 1st most reported adverse event where Ibuprofen is suspected.
- Despite being on the market for a short time Vioxx and Bextra are listed in the top 10 most reported drugs.
Research into the causes of these insights could lead to preferable patient outcomes and more valuable adverse event reporting.
In just a few minutes we've connected, tabulated and gained insights from paginated non-tabular Web API data with REFORM and Power BI. Whenever we refresh the report REFORM retrieves the latest data so that Power BI can update our charts.
Joining datasets with REFORM
REFORM shows us that in the adverse events dataset, drug characterizations are denoted as numbers. These numbers represent whether or not the drug was suspected as causing the adverse reaction or interacting with a suspected drug.
The meanings of these numbers are described in this table: https://reform-fda.s3.amazonaws.com/drug_characterizations.csv
We can combine this with the adverse events dataset by using REFORM to load both as tables into our analysis software, database or warehouse then joining the two tables together. This will give us the drug characterizations in english, rather than as numbers.
To do this add the dataset as a source, pick the columns then push the results into your favourite analysis software, database or warehouse.
When using SQL, Tableau, Looker or ThoughtSpot to analyse the data, we use "Join". When using Power BI we use "Manage relationships".
This straightforward process can be applied to any dataset, allowing us to join data from both tabular and non-tabular datasets as we please. This can be applied to datasources such as relational databases, CSV files, web APIs, NoSQL databases and others.
Analysing the tabulated data with SQL
Sometimes we would prefer to use normal SQL instead of software like Power BI. Now that the data has been tabulated, we can analyse it using straightforward SQL.
Using REFORM we can push the tabulated data into SQL databases and warehouses such as Snowflake and Postgres.
The FDA states that
When a report lists multiple drugs and multiple reactions, there is no way to conclude from the data therein that a given drug is responsible for a given reaction.
To reflect this, let's add a column which denotes whether each reaction is conclusive or not. We will also add a column showing drug characterization in english, rather than as a number.
create table drug_reactions as select * from adverse_drug_reactions left join characterization on drug_characterization_code = characterization_code right join ( select report_id as conclusive_report_id, count(distinct drug) = 1 or count(distinct reaction) = 1 as conclusive from adverse_drug_reactions group by report_id ) on report_id = conclusive_report_id;
Here we count the number of reports that do and don't meet this standard.
select count(distinct report_id), conclusive from drug_reactions group by conclusive;
As shown here, the majority of reports are not conclusive. If reporting was changed such that reactions were listed per drug, then all new reports would be. Of course, such a change is quite a large endeavour.
The FDA provides access to the following chart but the usefulness of it is limited. This is due to it including non-conclusive reports, drugs which weren't indicated for osteoarthritis and drugs which weren't suspected to have caused the reported adverse reactions.
Instead we'll do our own analysis with the following query.
select drug_class, count(distinct report_id) as frequency from drug_reactions where drug_indication = 'OSTEOARTHRITIS' and characterization = 'Suspect' and conclusive group by drug_class order by frequency desc limit 10;
Our results are much simpler. It is clear that the dataset does not include the pharmaceutical class of the majority of the suspected drugs indicated for osteoarthritis. Those drugs whose pharmaceutical classes are included are primarily NSAIDs.
Given the high number of missing pharmaceutical classes, we will need to drill down into the drugs themselves.
select regexp_replace(upper(drug), '\\W+', '') as sanitized_drug, count(distinct report_id) as frequency from drug_reactions where drug_indication = 'OSTEOARTHRITIS' and characterization = 'Suspect' and conclusive group by sanitized_drug order by frequency desc limit 10;
These values do not mean that drugs are more or less dangerous than each other, as there are many factors that affect reporting of adverse events. Please refer to the important disclaimer.
This table shows that, as expected, the majority of reported adverse reactions involving drugs indicated for osteoarthritis are NSAIDs. NSAIDs are often prescribed for osteoarthritis. Also included in this list are a TNF inhibitor, a monoclonal antibody and a gabapentinoid.
Other drugs which are often prescribed for osteoarthritis include acetaminophen and opioid analgesics, however, these don't feature prominently in this table. This may be the result of reporting practices.
Vioxx is an NSAID which was approved by the FDA in 1999 and voluntarily removed from the US market in 2004 due to concerns about risks of heart attack and stroke. Despite only being on the market for a short period, it is prominently featured on this list. Bextra (also known as Meloxicam) is an NSAID which was approved by the FDA in 2001 and removed from the US market by the FDA in 2005. The other drugs in this list are approved at the time of writing.
Celebrex (celecoxib), Voltaren and Ibuprofen are NSAIDs. Celebrex is reported as the suspected cause of adverse reactions more than twice as often than Ibuprofen. Some organisations, such as Best Practice Advocacy Centre New Zealand, state that "A significant benefit of celecoxib is that it is associated with less risk of gastrointestinal bleeding compared to non-selective NSAIDs". Let us see if the BPACNZ statement is reflected in this dataset.
select regexp_replace(upper(drug), '\\W+') as sanitized_drug, count(distinct report_id) as gi_bleed_frequency from drug_reactions where drug_indication = 'OSTEOARTHRITIS' and characterization = 'Suspect' and conclusive and upper(reaction) like '%GASTROINTESTINAL HAEMORRHAGE%' group by sanitized_drug order by gi_bleed_frequency desc limit 10;
The frequency of reported gastrointentinal bleeding as a suspected reaction to Celebrex and Ibuprofen are similar. This is not proportional to the overall number of reports for each drug, so let us inspect the top reported reactions for Celebrex and Ibuprofen.
Celebrex:
select upper(reaction) as sanitized_reaction, count(distinct report_id) as frequency from drug_reactions where drug_indication = 'OSTEOARTHRITIS' and characterization = 'Suspect' and conclusive and ( regexp_replace(upper(drug), '\\W+') = 'CELECOXIB' or regexp_replace(upper(drug), '\\W+') = 'CELEBREX' ) group by sanitized_reaction order by frequency desc limit 20;
Ibuprofen:
select upper(reaction) as sanitized_reaction, count(distinct report_id) as frequency from drug_reactions where drug_indication = 'OSTEOARTHRITIS' and characterization = 'Suspect' and conclusive and regexp_replace(upper(drug), '\\W+') = 'IBUPROFEN' group by sanitized_reaction order by frequency desc limit 5;
Gastrointestinal bleeding is the 15th most reported adverse event where Celebrex is suspected and the 3rd most reported adverse event where Ibuprofen is suspected. This seems to reflect BPACNZ's statement, however, there are many possible causes for these results. These results should not be used to make decisions about medical care.
Let's summarize our insights into adverse reaction reporting of drugs indicated for osteoarthritis.
- Out of 23217 reports only 10457 could be used to conclude that a given drug is suspected as responsible for a given reaction.
- Out of the 10457 resulting reports only 246 contained drug classification information.
- There are more than twice as many adverse reaction reports where Celebrex is suspected as responsible than there are where Ibuprofen is suspected as responsible.
- Gastrointestinal bleeding is the 15th most reported adverse event where Celebrex is suspected and the 3rd most reported adverse event where Ibuprofen is suspected.
- Despite being on the market for a short time, Vioxx and Bextra are listed in the top 10 most reported drugs.
Research into the causes of these insights could lead to preferable patient outcomes and more valuable adverse event reporting.
In just a few minutes, we have connected to and tabulated the data with REFORM, then used straightforward SQL to gain a number of insights into the data.
Self-service and pre-preparation
Traditionally data is pre-prepared into a normalised schema in a warehouse for analysts. REFORM speeds up both the compute and labour of this process as well as ensuring accuracy.
There is an alternative approach. Providing analysts with direct access to REFORM enables them to fulfil their own data requests by picking the exact data they need for their analysis. This enables analysts to find the needle in the haystack, improves performance and eliminates the time and cost involved in preparing, updating and storing un-needed data.
Both approaches have benefits and you should choose the one that is right for your team. REFORM will provide improved performance and ease of use, whether you choose to prepare the data in advance for your analysts or whether you give your analysts the self-service power to fulfil their own data requests.
Self-service and pre-preparation
Each report contains a list of drugs and a list of reactions. We picked data about reactions and drugs in the same table. As such, REFORM provides us with a table where the number of rows is the number of reactions multiplied by the number of drugs.
Instead, we could pick one table for reactions and another for drugs. This alternative approach provides a normalised schema over the data which can then be joined together as needed. This is useful, however it pushes the responsibility for joins onto downstream software such as Power BI or Snowflake.
REFORM is powered by proprietary technologies that make such joins orders of magnitude faster. Picking a table containing the information needed for your analysis can substantially improve query times and reduce compute and storage costs compared to the traditional approach.
Streaming, virtualisation and scale
Tables in REFORM are streaming and virtualised. We can think of this as though the REFORM table were a grocery list rather than the groceries themselves. We can use the same list in different stores the same way we can use the same REFORM table with different datasets. We can also export tables and import them into different copies of REFORM.
REFORM does not load the whole dataset into memory, nor does it store the data at rest. REFORM performs streaming selects, filters and tabulation at the byte level and streams the data to the destination. This approach reduces memory and storage requirements. Combined with REFORM's built in clustering, this makes scaling straightforward and efficient as well as enabling more efficient cloud solutions.