Data Interactions, FAIR Data and Digital Twins
Or - "The if statement that changed the world"
In plays and films, books and music there is often a key “moment” where everything in the story comes together. Think of the “I am your father” moment from Star Wars or the change from major to minor, Odette/Odile, moment in Tchaikovsky’s Swan Lake. Software and data engineers know these moments, too: the moment when, after days of work, you get everything together in your code and data so you can, *finally*, write and run “The if statement that changed the world”?.
A version of the “if” statement might be
if river.level > X and rainfall.forecast > Y
I’m sure you can write the “then” part of this “if” yourself, and it’s likely to involve millions of pounds of damage, weeks of transport disruption and possible loss of life.
This “if” statement is our first kind of data interaction. A computer algorithm (important emphasis) has brought two pieces of data together so they can be compared and some insight gleaned. The story of what those pieces of data are and how they get to the “if” statement is more complex than you might think.
The first chapter of the story of how the river levels and the rainfall get together is about finding data. There’s no search engine for data: not publicly and rarely within enterprises. There are attempts at searchability such as data.gov.uk but they are not intended for algorithms, rather they are for people. It’s said that data scientists spend at least 50% of their time looking for data: not looking at data, looking for it. This epic waste of time is because data is hidden away, deliberately or unintentionally, in silos, in datasets, behind APIs or in program-unfriendly formats, such as PDF. This is definitely not findable by machines, but what if it was?
领英推荐
The next chapter of the story is about access and interoperability. The two are linked. Our imaginary computer might be able to find some data but it might not be able to understand it. It would definitely be nice for interoperability purposes if it could have some metadata to indicate that the river level was measured in metres and the rainfall in millimetres, but what if it did? We now have Find, Access and Interoperate and the data interaction in the “if” statement is Re-using that data for our new purpose. There’s more to the story of FAIR data, but that’s in another post.
If we dare to imagine the “What if”: a world where data is FAIR, and our computer can find this data and understand it, we still can’t program it with that “if” statement when the data is in large datasets. What our algorithm also needs is the river level at this location and the rainfall forecast at a different location, probably well upstream from the place where the flood is likely to occur. So, even if our algorithm can find the right dataset, it still needs to know how to run a query against the dataset to find the data is wants. There’s an element of granularity of the data that is important - and that’s where the digital twins come in.?
Digital twins are a virtualisation of an asset’s data. The asset is a useful level of granularity here. Our algorithm needs to be able to choose these rainfall forecasts, but not all of them and those river levels but not some others. Metadata about the assets beyond their location might also be useful. Knowing who operated them would help our algorithm to assign weight to the readings if some operators’ data proved more reliable and accurate. Having some provable provenance of the data as actually coming from that twin and the twin really being the one operated by the Environment Agency, for example, would build trust in the output of our imagined algorithm. The exchange of metadata between twins to establish trust and access is our second data interaction.
The final chapter of the data’s story to get to the “if” statement is about timeliness. Homeowners won’t appreciate being told on Wednesday that a flood would occur on Tuesday when their houses are already knee-deep in muddy water. The data needs to flow between the twins and the algorithm as close to real time as possible so that the predictions are available in a timely way. This is not just important in our imaginary flooding scenario; it’s important in business, where latency between something happening and the business reacting to it can cost millions.
A couple of questions to answer before we conclude. We imagined an algorithm running, exchanging data with digital twins. First question: What does the algorithm do in the “then” part of the “if”? Change a dashboard? Update a database? Send an email? What if it could share the data back with other digital twins, or create new twins of the likely flood locations and have them share into a growing ecosystem of cooperative twins? That would be more FAIR. Second question: In what context does the algorithm run? I don’t think the answer will surprise you... In its own digital twin. Doing it this way simplifies the model (everything is a twin) and creates a nice symmetry in the problem. The twin of the algorithm interacts with the twins of the data sources. Data interactions? Data interactions are twin interactions. Twin interactions are the exchange of data and metadata between twins.
There are so many key words in this post. I’m going to concentrate on two: “interact” and “if”. Interact implies “between” - that there are at least two parties involved. A consumer interacts with a producer; a supplier with a customer. Both have agency in the interaction. A customer doesn’t have to accept the supplier’s goods or service. The producer can cut off a consumer if they don’t like their behaviour. The second word is “if”. We started with a programmatic sense of the word as “if - then”. We moved on to an imaginary world of “what - if” and finally I’ll leave you with an “if - only”. “If only” data and twins could cooperate. What transformation could we achieve then?
Helping organizations successfully navigate their information technology initiatives
3 年I find that one of the contextual elements missing from datasets and the data constructs that compose them are the behavior (i.e. algorithms) that operate on them. We treat these distinctly separate when we publish datasets and so 'interacting' with the data and it's relationships is largely void of any behavior unless you provide it from scratch yourself. When you do, its not easy for anyone else to leverage it in the future. Making data constructs, their relationships, and available behavior FAIR would be ideal
Helping organizations successfully navigate their information technology initiatives
3 年Sandra Nunn
Philip Morris
3 年Really good article. Finding data is challenging especially if you need it to be FAIR. FYI Google's Dataset Search is a search engine for data sets. Europe also has an initiative called National Access Points covering mobility data which might be of use.