How to generate complex test data
Andrew Magerman
Requirements Engineer | Software Developer | Event Organizer | Python | CI/CD. I help companies kill complexity.
I had the pleasure in 2017 of working closely with Josef B?sze and Ralph Schibli, both partners at the boutique consultancy itopia.ch. The project I was on involved the production and configuration of synthetic data.
In this case, the customer, a bank, was rolling out a new version of its core software and needed a reliable source of test data.
the rapid advances in machine learning are making true data anonymisation impossible
Cloning and anonymising production data could not be used, first and foremost because banks are understandably very protective of their customer's data (and the rapid advances in machine learning are making true anonymisation almost impossible), and secondly because since we were testing a new system which had a different data model than the previous one, some historical data simply did not exist.
The solution is synthetic data, i.e. data created from scratch. itopia created a tooling suite, iSynth, which enables you to scalably synthesise your test data.
iSynth creates several abstraction layers above the actual data structure by creating a model of your system. Generating test data for the system then becomes an exercise in manipulating the model.
Practically this means that, once the system has been set up and configured, one can generate a wide variety of use cases by manipulating the model objects, instead of dealing directly with data. It means that, once you have done the intellectually challenging work of modelling the data structure, you can generate any sort of weird combination accurately and very quickly.
The somewhat cumbersome work of mapping the model to actual database tables is the time-consuming part of the endeavour. You need local experts who know the underlying databases and interfaces off by heart - what does that particular flag in that column do, exactly? How is data consistency enforced?
It's the availability of these experts which is more often than not the bottleneck for generating correct test data.
Once the iSynth system is set up, however, you don't need that expert knowledge to such a continuous extent anymore (whereas other traditional systems of test data generation need to consult with these experts whenever there is a new requirement)
any new requirement is usually met in a matter of hours
The upside to all this is that any new requirement is usually met in a matter of hours, not days, and without any expert knowledge of the underlying systems.
iSynth also lends itself to automatic generation of test data for training purposes. The use cases which we actively implemented were the generation of datasets for training purposes, on special training systems. The beauty of this is that the result is deterministic - once you have determined the testing use cases, generating all the required data is just one click away.
The other immediately recognisable use case is in the generation of test data, useful in the case of load testing (five million records, anybody?), but also - and here iSynth really shines - the generation of rare data constellations - perhaps your biggest customer has a really complex setup, with multiple exceptions? No problem for iSynth.
The next stage of the development is integrating the generation of synthetic data within a continuous integration application lifecycle, with unit tests which can be run against a known set of test data. iSynth can be distributed as a separate docker instance.
I had a most enjoyable experience working with the itopia team; to my knowledge this approach is unparalleled and very promising.
If you have complex requirements for synthetic test data, have a look at iSynth's fact sheet on www.itopia.ch/synthetic-data. I'd be glad to introduce you to either Ralph Schibli or Josef B?sze at itopia.