The PDB file format must DIE!  Time for a Protein Revolution!

The PDB file format must DIE! Time for a Protein Revolution!

Are you tired of wrestling with the outdated, clunky PDB format for storing protein structural information?? Are you fed up with its limitations and inconsistencies?

If you are working in computational biology, protein engineering or bioinformatics, you have undoubtedly had your fair share of struggles with the ancient relic known as the Protein Data Bank (PDB) file format.

I say ancient, since PDB was invented in 1976.? For context, ENIAC (the first programmable, electronic, general purpose computer) was completed in 1945. ENIAC was still powered by vacuum tubes! ? 31 years between ENIAC and the first PDB specification and 48 years between PDB and now!? Yet, we are still stuck with PDB!

To be fair, PDB has been the workhorse for many structural biologists for many years and it is supported by numerous tools. But it is 2024 now. We are generating massive biological datasets. We're simulating and calculating properties of proteins that PDB simply can't handle. We have amazing machine learning models that are starving for data.

Here are a few reasons why PDB must kick the bucket:

Format Frustration:? Dealing with a format as user-friendly as a Rubik's Cube on a Monday morning is no joke.? PDB files are notorious for their lack of flexibility and compatibility. Want to add an additional property to an atom?? Good luck trying to hack the fixed 80 character column width to store your property (wink, wink PQR).? Seriously, is there anyone still reading a PDB file on a terminal?

Interoperability Woes: Ever tried to combine different tools and applications, only to discover that they speak entirely different PDB dialects? It's like attending a conference where everyone's speaking a different language.? With PDB’s maze of columns and cryptic codes and inflexible representation, it is no wonder that different tools interpret the format slightly differently. Unfortunately, this introduces slight incompatibilities between tools, which often goes unnoticed.

Anomalies Galore:? Ever opened a PDB file only to find missing atoms, wonky residues, or bizarre atomic clashes? It's like stumbling upon a hidden treasure trove (read skeleton graveyard) of biological oddities. Who has time to play detective with protein structures in a never-ending game of “spot the anomaly”?? Unfortunately, every new PhD student or researcher downloading a PDB file from the RCSB has to analyse and deal with these oddities again, like a hundred people before.


Ok, but what is the alternative???

Some attempts have been made to improve upon the PDB file format. The MMTF (MacroMolecular Transmission Format) and mmCIF (Macromolecular Crystallographic Information File) are two examples of such file formats. In a later article I will speak about their pros and cons.

Here, I want to look at some of the properties that a competitor to PDB should possess:

Flexibility and Extensibility:? We need a format that's as flexible as a yoga instructor and as adaptable as a chameleon at a rainbow convention. Adding new properties should be as straightforward as adding avocado to toast – no complicated rituals or arcane workarounds required!? Calculating surface areas of residues or atoms?? We should be able to easily store them as properties of those entities. Adding annotations to a chain in a protein?? The format should allow you to do so easily and efficiently.

Robustness and Consistency: An ideal format should minimize the occurrence of anomalies and errors commonly encountered in PDB files. Consistent data representation and validation mechanisms are crucial to maintain data integrity and reliability.

Interoperability: Imagine a world where different computational tools hold hands and sing Kumbaya instead of the Tower of Babel scenario we are currently facing. This would allow us to build interoperable pipelines quickly and efficiently. This is facilitated by an easy-to-understand, yet clearly defined format based on already-existing technologies and best practices.

Efficiency and scalability.? We need a format that’s as lean and mean as a marathon runner on a low-carb diet. It should be optimized for efficient storage, transmission, and processing of large-scale structural data. It should leverage modern compression techniques and data structures to minimize storage overheads and computational burdens.


To solve some of these problems, we have developed Protkit - a unified toolkit for protein engineering.?

Protkit is an open source Python library that can be used for a variety of tasks in computational biology and bioinformatics, focusing on structural bioinformatics, protein engineering and machine learning.? You can read about our aims with Protkit at https://protkit.silicogenesis.com/.

As part of this initiative, we have developed the Prot file format to address the shortcomings of PDB files. Prot files aren’t just another data format - they're the superhero capes of the protein engineering world, swooping in to rescue researchers from the clutches of PDB-induced despair.? It's like upgrading from a flip phone to the latest smartphone.

Prot files have the following advantages over PDB files:

Multi-protein:? A single Prot file can store multiple proteins. You can pass around an entire database of structural protein data in a single file (a little bit akin to the Fasta file format for sequences).

Hierarchical.? Prot files maintain the hierarchical properties that are inherent in proteins. Proteins contain chains, containing residues, containing atoms. This makes working with protein structures much easier.

Extensible.? You can add any new property at any level in the hierarchy.? Besides the core properties, the format is not prescriptive - you can add different property types (from simple numbers and text to arrays or dictionaries) using any key you desire.

Open.? The file format is open and based on JSON. Both compressed and uncompressed versions are supported.? JSON is easy to understand and many software libraries exist to work with it. Compression using the zlib library ensures that file sizes are small. In our experiments, compressed JSON files are significantly smaller than corresponding PDB files.

In time, we will improve or extend the format to support various machine learning applications and publish datasets with sanitised protein structural data.

In summary, Prot files keep it simple, like a kindergarten finger-painting, minus the mess.


I realise that a lot of what I have written is a bit of tongue-in-the-cheek. But I really do invite you to explore Protkit and the Prot file format.? It may pleasantly surprise you.? Follow the links available at https://protkit.silicogenesis.com/ for more information.

P.S. Have a PDB horror story to share? Drop it in the comments below! Let's commiserate over our collective struggles and celebrate the dawn of a new era in protein engineering.

Fred Senekal


Nikhil Haas

Co-Founder/CEO @ BioLM.ai | Better Bio-AI Tools | Ex Twist Bio

7 个月

Half the time the PDBs aren’t formatted correctly to run any structural comparison. We always have to figure out which PDB generator will work with a tool…

回复
Danny Diaz

ML Protein Engineer @ IFML | Entrepreneur

7 个月

I am a big fan of moving everything over to the cif format. I agree the pdb format must die. I hesitate to use tools that only have a pdb parser.

Pedri Claassens

Scientist, Developer (full-stack exposure) and fascinated by the Human condition.

7 个月

As someone who had to work with legacy file systems for example .sam and bams plus good old samtools. I actually agree. Those tools did a great job and got us where needed. But it is time to harness new developments and technologies.

回复
Mechiel Nieuwoudt

BioAI Research Engineer at InstaDeep

7 个月

I've faced challenges with the PDB format in large simulations, especially when numbering water molecules exceeded 9,999, leading to non-standard residue ID solutions like base 36 encoding. However, compatibility issues arose with OpenMM, which does not support this method when writing files, resulting in duplicate residue IDs and unusable output files. This underscores the limitations of PDB for large models and highlights the need for transitioning to formats like the Prot files.

回复
Alexander Kislukhin, PhD

Solving pain with data ?? I build life sci data supply chains

7 个月

I'm on board whatever as long as it's not XML ??

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了