Recognizing Fake FAIR
The real or the fake Venice?

Recognizing Fake FAIR

FAIR is a recurring buzzword in modern enterprise-level data management. This has led to the fact that, in various enterprises, almost any new data initiative sets out by stating "everything we do is FAIR". What follows are various levels of understanding starting with fair, as in "fair play", over extremely shallow interpretations of Findable, Accessible, Interoperable, and Reusable, towards the more sophisticated linkage to the respective parts of the 15 FAIR Guiding Principles. The latter is rarely achieved and it often comes as a surprise that there are 15 principles indeed, not four.

It doesn't matter whether we are talking about using in-house development or licensed vendor solutions. This is why we need some simple rules to distinguish the "real thing" from other software and storage solutions that still may have value for the scenarios they were originally designed for. In the end, we should not forget that FAIR is about both: the mindset for good data citizenship (R1) but a lot also about the technical tools that help facilitate the former.

The value of FAIR data

"To be Findable:
F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
To be Reusable:
R1. (meta)data are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards"
The FAIR guiding principles

First things first: Why is the move towards FAIR data important? Why should we do "FAIR by the letter"? This is a million dollar question. I believe the best answer to this is to quote the amount of value lost by not doing it. For example, the energy that is wasted by "FAIRifying" existing datasets while fully acknowledging the fact that this is only possible up to a certain degree. Metadata and the descriptive parameters around the context of data creation can not be imputed. They are lost from the moment they are not captured. This issue typically surfaces in meetings when the decision maker asks the data scientist "what is your exact definition of 'severe'?". Clearly, this information was present in the minds of many when the dataset was originally created at company A (initial $$$). However, nobody really thought about that specific field when it was acquired ($$$) by company B. And the team of six highly-skilled (and highly-paid $$$) data scientists has massaged the data and trained their predictive models for the last five months ($$$) to come up with the presented break-through finding (future $$$). The answer then is "we need to go back to company A and see if we find someone who can get this information for us". All of a sudden the "most innovative thing that company B has seen in the past five years" depends on good contacts and the goodwill of folks in company A as well as - potentially - simply luck.

The critical reader will say: "This can also happen to a FAIR dataset". The answer is no, not really. In the FAIR world, the word 'severe' only exists as label to a metadata field identified with a persistent and global identifier such as https://purl.example.org/a2m34 (see F1). This metadata field, next to a label, also has a description (that answers the question of the executive in company B) a creator, provenance (R1.2), same-as links to other standards that use it in the same way (I3, R1.3), and many other descriptive links that help to get a clear picture what https://purl.example.org/a2m34 is really about (see F2 and R1.x). Especially the pointer to the creator can help to identify the individual who really should know; without this, in many firms, reconstructing this field alone can turn into a multi-week rabbit hole. Last but not least, the URL can be resolved and all this information is just one click away for the overwhelmed data scientists (see A1 and A2).

We need to consider all of the above when talk about respecting the FAIR principles. Unfortunately, this call for "proper data documentation at the point of creation" is often misinterpreted in various ways.

Note: When the last part of the URL is opaque (i.e., "appears random"), the dataset can not even partially be interpreted without all the descriptive parameters. As such, company B would not even consider to acquire a dataset where incidents or diseases are rated with "random" URLs that have no further meaning attached to them. In addition, the adjective "severe" may trigger assumptions and misinterpretations. Just check the comment on FHIR's Patient.gender field to get an idea about the nuances that can come with the use of natural language. The proverb "half knowledge is dangerous" applies very often when important metadata is either non-existent, not rich enough, or inaccessible.

Fake FAIR

Sometimes, when you ask folks about FAIR they often make four claims as to interpret the words that make the acronym. In many cases, these interpretations are extremely shallow and off the actual point. Here is a fictitious example:

  1. Findable: "Our platform supports free-text search and can index strings" - no mention of identifiers, their global uniqueness, and metadata (see F1-F4)
  2. Accessible: "Our platform runs in the cloud and can be accessed from anywhere on this planet with low latency. We solve this with GSLB" - no mention of identifiers, standardized open communication protocols, and metadata (A1.x, A2)
  3. Interoperable: "Our platform is compatible to Kafka, can import CSV files of any sort, and also can make ODBC connections" - no mention of knowledge representation, vocabularies, and references to other (meta) data (I1 - I3)
  4. Reusable: "Our platform was specifically designed for various use cases and offers full flexibility. Talk to our sales engineers if you are missing anything" - no mention of licensing, provenance, and community standards (R1.x).

Obviously, what we see here is a classic "playing chess with a pigeon" situation. All of these claims are loosely related and almost offensive to the FAIR concept in itself.

If you made it until here you should have a coarse understanding of the value of FAIR data and how completely free interpretations of the four words should not convince us. Each word of the acronym does have a clear definition but the FAIR concept is also not prescriptive about specific technology choices. We will now present a set of questions that should help the reader to detect more subtle misinterpretations.

FINDABLE

"Each data point gets a unique identifier and we store metadata in an indexed resource."

Questions to ask:

  1. How do you ensure global uniqueness of the identifiers?

The answer should include one of two concepts: URIs and UUIDs (they can also be combined). Considering also A1 it should boil down to URLs (a subclass of URIs).

2. How do you add metadata to metadata?

What we would expect here is something along the lines of "there is no clear distinction between data and metadata". This seems counterintuitive but from a practical perspective it makes sense. For example, in many applications it is relevant which fields of the actual data encompass a certain level of privacy protection. This "privacy-level" field is metadata about a metadata field that says "customer name" (and others). Note, the actual "data" here are still values for the "customer name" attribute, for example "John Doe". Therefore, both fields "privacy level" and "customer name" need their own metadata definitions while the former is part of the metadata of the latter. This can get arbitrary nested (even with loops) and it is therefore better not to distinguish between data and metadata. See also Kees' Rant about 'metadata' .

ACCESSIBLE

"We can provide a Web link for each resource and offer different authentication + authorization methods."

Questions to ask:

  1. Is the Web link the identifier?

The answer should elaborate that, indeed, the metadata uses stable URLs for the data items in their description (F3). Otherwise you find yourself in a traditional database "my primary key is your foreign key" situation.

2. What happens when the data is deleted?

The metadata should remain available and include an update that informs us that the "original" resource has been deleted (A2). How does this relate to the previous point on "metadata about metadata"? In essence A2 also means that metadata should generally not be deleted (no matter at which level of the nested metadata structure you are in).

3. Can it happen that a new (meta)data item receives the identifier of a deleted (meta)data item?

When we work with traditional databases (that often use counters with auto-increment) this is basically impossible. However, there are cases where identifiers are randomly generated and the database is probed for an existing record. If the existing record is not invalidated but indeed removed by the "delete" action you may find yourself in a situation where the probe works but the identifier has existed before. In that scenario you may have other resources (such as the metadata of the removed item - not deleted, see A2) pointing to it but now describing a different resource.

INTEROPERABLE

"We offer REST APIs with a Swagger description."

Question to ask:

  1. What is your "formal, accessible, shared, and broadly applicable language for knowledge representation" (I1)?

The answer should mention an ontology or RDF vocabulary/schema (or some type of frame-based knowledge representation) . Note that JSON, XML, and CSV are data interchange and file format languages and are therefore only compatible to I1 when they are RDF serializations of an ontology (in the form of RDF/XML or JSON-LD or N-Triples). So what is the difference?

Non-technically: JSON (steel), XML (wood), and CSV (plastic) are materials with which you can do pretty much anything but RDF is a chair that can be made of either material. It is on a different abstraction level. When we ask for a "knowledge representation language" we are looking for a comfortable place to sit on. Why is this important? We are having a meeting (of datasets) and everyone brings their own chair. We don't care about the material any brought chair is made of - we can still do the meeting - but if anyone brings a flag pole we are going to have a hard time.

Technically: JSON, XML, and CSV are extremely versatile and can be used in many forms and shapes. The RDF/XML, JSON-LD, and N-Triples (most similar to CSV) serializations of RDF restrict the flexibility and add additional rules that make them applicable for knowledge representation. The most prominent and, in my opinion, important rule is that relations between any two data points have - by default - a clearly defined meaning attached to them. Why is this important? Other FAIR data providers also use RDF for knowledge representation. In the very best case two or more RDF graphs can be joined to form a bigger graph with an operation as easy as concatenating two or more files - without any ETL operations.

All of this comes down to the self-descriptiveness of RDF. In a JSON attribute that is named "gender" there are just six letters and lots of wrong assumptions. Behind "https://hl7.org/fhir/Patient.gender" there is a whole universe.

REUSABLE

We use our custom vocabulary but you can provide extensions.

Questions to ask:

  1. Do we have access to information about when the (extension) metadata fields were created and by whom?

This adds to the above point on "metadata for metadata". We need to be able to track decision processes on shared vocabularies (see also here for a cookbook: schema.org - A Blueprint for Enterprise Ontology Engineering ).

2. What is the licensing scheme for the part that is already available?

This should permit you to share or even sell your datasets without any legal risk. In the very best case, the license is machine-readable.

3. How is your custom schema compatible to community standard X?

This is quite interesting. In the case of taxonomies, for example, you could ask for the compatibility to SKOS. In the domain of healthcare there is no "one" standard but key players such as SNOMED, FHIR, CDSIC, and LOINC should be part of the discussion.

Summary

The set of questions about each of the FAIR principles is non-exhaustive. Please feel free to comment and suggest additional questions. It also has to be noted that there are several articles and initiatives to "measure" FAIRness with metrics. Often these evaluators are of very technical/quantitative nature. This article tries to complement them with a set of qualitative questions.

It is almost tautological that FAIR would come with a clear definition. After all, this is what FAIR is about: structured information creation/sharing with clear definitions. Its rapid adoption was most likely also due to the fact that the authors were not prescriptive about any specific technology (next to the catchy acronym). However, we should be aware of situations where the acronym is misused - often to the point of absurdity. We should also be aware that the full value comes into place if we respect the 15 principles.

I would like to thank Javier D. Ferdández and Nelia Lasierra for their useful comments on a draft of this article. The article itself was inspired by "Detecting Agile BS " and "Understanding Fake Agile ".

Hi Friends, cannot believe I only got this post now through Erik Schultes. I see that most of my direct colleagues have already reacted some time ago. Myself I mostly use the term ‘pseudo-FAIR’ iso. Fake. Sounds less ‘intentional’ to me. But indeed, this trend to re-interpret FAIR to pretend your data is (or ‘always was’ as Rob noted) is a downside of the compelling acronym. The goal is R, but indeed there is also the logical chronology. But now comes the good news: FAIR is now too strong as a concept to be nullified by people (even some co-authors of the original paper I regret to say) and reduced to some vague properties. As Mark said it very eloquently in a recent seminar: if you want to know how FAIR you are, ask a machine’. Basta. As he said we only recently approach the situation that some evaluators truly measure consensus features of the principles, but we never said FAIR would be easy. The ‘Machine Actionable’ aspect (clearly spelled out in the seminal paper) are mostly scaring people in traditional data management. However, my counter argument is always: because machines are so stupid it forces you to be much more precise and thus reduce ambiguity and thus FAIR also helps people. I love that this comes from ‘industry’

Dana Vanderwall

Business Consultant at Digital Lab Consulting - Accelerating science through digital transformation

1 年

#FAIRwashing

回复
Mario Moser

Enabling data-driven capabilities in engineering | Wissenschaftlicher Mitarbeiter (Research Associate) am Werkzeugmaschinenlabor WZL der RWTH Aachen University

1 年

Great article to explain that there is more behind it than just these four letters! And indeed, sometimes I get the impression that one might think ‘My data can (somehow) be found - so it is findable according to FAIR‘. ?? Even the 15 guiding principles are by design high-level. If you are looking for concrete implementations, you might check out FAIR _Metrics_ that base on these principles, e.g. the Maturity Indicators (MI) or FAIR Metrics by Research Data Alliance, FAIRs FAIR or European Open Science Cloud respectively. There are even more, e.g. specifically for software. In addition, several online assessment tools to evaluate FAIRness have been published.

Erik Flikkenschild

Deputy director Go FAIR Foundation

1 年

There are ofcourse different community starting points. Next milestone is always more structure and definition, which can be demonstrated: data is has now become machine actionable and therefore faster re-usable.

Matt Harrison

Head of the UKRI Environmental Data Service

2 年

This is a really thought provoking article and very true. In fact since I read it more and more examples are popping into my head ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了