In clinical research, efficient and standardized data exchange is essential. Regulatory agencies, sponsors, and CROs depend on robust data formats to facilitate seamless submissions, sharing, and interoperability. Traditionally, the industry has used SAS Version 5 Transport (SAS XPT V5) files for regulatory submissions, with Dataset-XML later introduced as an alternative. However, these formats have notable limitations that impede efficiency. To overcome these challenges, CDISC has introduced Dataset-JSON, a modern data exchange format designed to serve as a more effective replacement for SAS XPT.
This article does not cover other data exchange formats, such as HL7 Version 3, Resource Description Framework (RDF), Web Ontology Language (OWL), and Analytic Information Markup Language (AnIML), which were considered potential candidates for replacing SAS XPT V5.
A Brief History of Data Exchange Formats in Clinical Trials
For decades, the pharmaceutical industry has relied on SAS XPT V5 as the standard data exchange format for regulatory submissions. Below is a timeline of key developments:
- SAS XPT V5 (1989): SAS XPT V5, or SAS Version 5 Transport format, was introduced by SAS Institute in 1989. The format was designed to facilitate data transfer between different SAS environments and quickly became the industry standard for regulatory submissions to agencies like the FDA. Despite being widely used, it imposed strict limitations on variable name length (8 characters) and dataset name length (8 characters), making it cumbersome for modern data processing.
- SAS XPT V8 (2012): Recognizing the limitations of XPT V5, SAS introduced XPT V8 in 2012 as an enhanced version of the transport format. XPT V8 removed the 8-character variable name limitation, allowing longer, more meaningful names, and increased compatibility with modern data structures. However, despite these improvements, XPT V8 has not yet been widely adopted by regulatory agencies, meaning XPT V5 remains the standard for submissions.
- Dataset-XML (2014): In an effort to modernize data exchange, CDISC introduced Dataset-XML in 2014. This format was based on ODM-XML and aimed to provide a more structured and readable alternative to SAS XPT V5. However, despite its technical advantages, Dataset-XML was never widely adopted due to its inefficiencies in handling large datasets, lack of broad software support, and challenges in practical implementation.
- Dataset-JSON (2024): Recognizing the need for a more efficient and interoperable data exchange format, CDISC released Dataset-JSON v1.1 in 2024. Dataset-JSON addresses the limitations of previous formats by allowing long variable names, supporting rich metadata, improving interoperability with modern programming languages and ensuring efficient handling of large datasets with Dataset-NDJSON.
Limitations of SAS XPT V5 and Dataset-XML
SAS XPT V5:
- File Size and Variable Name Restrictions: SAS XPT V5 limits variable names to 8 characters and dataset names to 8 characters, making it difficult to use meaningful, descriptive names.
- Limited Data Types: The format is limited to two data types: numeric and character.
- Inefficient Handling of Large Datasets: The format is inefficient when dealing with large datasets.
- Limited Interoperability: SAS XPT V5 was designed for SAS users, making it difficult to use natively in other programming languages like R or Python without conversion tools.
- Impact on CDISC Standards: SAS XPT V5 limits the advancement of CDISC data standards, including SDTM and ADaM. The primary reason SDTM and ADaM variable names are restricted to 8 characters, variable labels to 40 characters, character variable lengths to 200 bytes, and dataset names to 8 characters is to maintain compatibility with SAS XPT V5.
- Character encoding\transcoding Issues: The XPT specification states "The SAS transport file should be read in a SAS session encoding that is compatible with the encoding used to create the file. There is no method of conveying encoding information other than documenting it with the delivery of the transport file". This limitation of format can be problematic.
Dataset-XML:
- Lack of Metadata Support: The Dataset-XML format does not include dataset or variable metadata; instead, it relies on Define-XML for metadata, which adds complexity to its implementation. Importing a Dataset-XML file requires both the Define-XML and some XSLT programming.
- Performance Issues: Parsing and processing Dataset-XML files can be slow due to the verbosity of XML, especially for large datasets. XML-based files are often significantly larger than their XPT counterparts, leading to storage and transmission inefficiencies.
- Limited Industry Adoption: Many existing tools and workflows were not optimized for XML, leading to resistance from industry stakeholders.
- Complexity in Implementation: Dataset-XML introduced technical and structural complexities that made adoption challenging for sponsors, CROs, and regulatory agencies.
The Benefits of Dataset-JSON
Dataset-JSON represents a significant leap forward by addressing the limitations of SAS XPT V5 and Dataset-XML while embracing modern data exchange standards. JSON has become the de facto data exchange format for APIs across industries, making Dataset-JSON a natural choice for modernizing clinical data exchange. Key benefits include:
- Human-Readable and Machine-Friendly Format: JSON is widely used across industries for data exchange, making it easier to integrate with modern programming languages. The format is lightweight and less verbose compared to XML, improving processing efficiency.
- No Length Restrictions on Variable and Dataset Names: Unlike SAS XPT V5, Dataset-JSON allows long and meaningful variable and dataset names, improving data clarity and usability.
- Rich Metadata Support: Dataset-JSON natively supports metadata, allowing better documentation and integration with CDISC standards like Define-XML.
- Enhanced Interoperability: Dataset-JSON is platform-independent and can be easily used across R, Python, SAS, and other programming environments. This is particularly important as the pharmaceutical industry is embracing open-source languages like R for clinical data analysis.
- Efficient Handling of Large Datasets: JSON-based formats support efficient parsing and handling of large datasets, reducing storage and processing burdens.
- Future-Proof and Scalable: With JSON being a widely adopted standard in modern software development, Dataset-JSON ensures longevity and adaptability as technology evolves.
The Importance of a Platform-Independent Data Exchange Format?
The pharmaceutical industry is witnessing a shift towards open-source programming languages, with R gaining traction for clinical trial analyses and reporting. Legacy formats like SAS XPT V5 create barriers for non-SAS users, requiring cumbersome conversion processes. Dataset-JSON eliminates these barriers by providing a truly platform-agnostic format that facilitates seamless data exchange across diverse programming environments.
By adopting Dataset-JSON, sponsors, CROs, regulatory agencies, and data vendors can improve efficiency, enhance data transparency, and future-proof their data exchange processes. Dataset-JSON is intended to replace SAS XPT as the primary regulatory submission format, offering a more efficient and interoperable solution for clinical trial data.
Conclusion
Dataset-JSON marks a transformative step in clinical research data exchange, addressing the long-standing limitations of SAS XPT V5 and Dataset-XML. With its modern, metadata-rich, and platform-independent design, Dataset-JSON paves the way for enhanced interoperability and efficiency in clinical trials. As the industry continues to evolve, adopting Dataset-JSON will be crucial for ensuring seamless data sharing and integration across programming environments and systems.
Senior Vice President, Product Strategy at EDETEK, Inc.
3 周Important development! API-based communication and system to system data exchange are better than XPT and ODM. We fully endorse it and will add support for Dataset-JSON across our data and metadata applications in 2025
Research Data Engineer | Open Source | Data Exchange Standards
3 周Nice article. We will soon be sharing the results of our recent Dataset-JSON Viewer Hackathon. It's great to see the growing set of open-source Dataset-JSON tools. We're also starting work on a standard Dataset-JSON API and have a draft spec ready for discussion.
Chief Technology Officer at KCR
3 周Nice clear article. I too would not describe json as transformative over xml but I do believe it is a further reason to embrace the standard over SAS V5 xpt. JSON handling in most languages including python is incredibly easy therefore reducing any resistance to change. Thx for sharing.
Passionate about standards in clinical research and healthcare, and their implementation in IT systems.
3 周Good analysis although I do not agree with everything, e.g. regarding storage and processing efficiency of XPT and Dataset-XML, Although also JSON has its disadvantages (e.g. harder to transform than XML which has XSLT, XQuery), it surely is the most suitable format for use with APIs. And that is exactly the direction regulatory submission should go into: away from "files", and into machine-machine communication. That would e,g, allow regulatory authorities to start review even when only 10-20% of the data has been collected. That of course also requires a mentality change at the regulatory authorities.
Manager at Parexel
3 周Great article. You already wrote that JSON is a common format for API data exchange and I think that this one of the biggest Dataset-JSON advantages. Formally you can implement exchange of XPT data via API as well, but with Dataset-JSON it is done natively, so you do not have to make any transformations of the data for that. Hopefully some years from now we will not upload files anymore, but just exchange it via API.