Using AI to Generate Test Data for Software Application Testing
Bryan Thorell
Talented Healthcare Executive / Technical & Business / Humble / Team Player
?
Introduction
In software development, one of the critical challenges is ensuring that applications function correctly under various conditions. This requires thorough testing, which in turn depends on the availability of high-quality test data. Traditionally, generating test data has been a time-consuming, manual process, often leading to incomplete or unrealistic datasets. However, the advent of artificial intelligence (AI) has revolutionized the approach to test data generation, making it faster, more accurate, and scalable.
My Scenario
The other day I needed to create lots of test data for a medical application.? I wanted to really exercise the APIs and DB to make sure that we ended up with data for all of the APIs and all of the DB tables for the application.? I needed this data to be random, but appropriate for the fields.? Several of the fields had to contain data from a discreet list of items.
Generating Diverse Data
I have used AI for many things, and mainly use ChatGPT, but I had noticed Microsoft Copilot included with Windows now and hadn’t used it much yet, I decided to put it to the test.? I started by giving it a general description of the data that I wanted.? I included the lists of values I wanted where there was a discreet list by using “or†between the items.? Something like the below:
Please create me 1000 test data items containing unique Last Name, First Name, DOB (between 16 and 80 years old), Sex, and Storage Class (Large or Medium or Small), Weight in standard range for humans, and a Note on 10% of the items.? (Working in healthcare, I asked for bunch of other medical related statistics, but can’t go into the specifics.)
I was pleased with the results, it created me a CSV file with 1000 rows of data each containing the data I asked it for. ?I did ask it to refine some of the items as I was not expecting lab values to the 8th decimal place, I just wanted integers. ?I guess that I was not descriptive enough in my initial prompt, but that’s the great thing about AI you can continue refining your request and it maintains the context. I was really pleased that Copilot correctly understood what I wanted and that my discreet lists were random and contained only the values specified.
Natural Language Processing for Text Data Generation
I have also had to test applications that deal with textual data, such as EHRs, document management systems, and customer support applications. ?NLP models can generate realistic text data for testing. These models can create synthetic documents, or user queries that closely resemble real-world inputs.
Recently I asked an AI to generate 10 3-page PDF files that I could attach to a document management system.? It asked me what topics I would like the PDF’s to be about and then prompted me with 10 titles on the subjects that it would create.? Because the PDFs were on topic, this was much more realistic test data than if I had grabbed some lorem ipsum documents from the internet.
Addressing Data Privacy and Security with AI
One of the challenges in generating test data, especially when using real production data, is ensuring that sensitive information is not exposed. AI can help mitigate this risk through data anonymization techniques. These techniques involve transforming the original data in a way that removes or obfuscates personally identifiable information (PII) or protected health information (PHI) while retaining the data’s overall structure and usefulness for testing.
For example, an AI model can automatically detect and anonymize data such as names, addresses, or credit card numbers in a dataset. This allows testers to use realistic data without compromising user privacy or violating data protection regulations.
?
Benefits of AI-Driven Test Data Generation
Efficiency and Scalability
AI significantly reduces the time required to generate test data, especially for large-scale or complex applications. What once took days or weeks can now be accomplished in a matter of hours. Additionally, AI can generate vast amounts of data, making it ideal for testing applications that require large datasets, such as performance testing or load testing.
Realism and Diversity
AI-generated test data is often more realistic and diverse than manually created data. Machine learning models can capture and replicate the subtle nuances and variations in real-world data, leading to more accurate testing outcomes. This diversity also helps in uncovering hidden bugs or vulnerabilities that might not be detected with less varied test data.
Focused Edge Case Testing
AI can be particularly effective in generating test data for edge cases—those rare but critical scenarios that can cause an application to fail. By focusing on these scenarios, AI ensures that the application is robust and can handle unexpected or extreme conditions without breaking.
Continuous Improvement
AI models can learn and improve over time. As new data becomes available or as testing uncovers new bugs or vulnerabilities, the AI can adjust its data generation process to focus on these areas. This continuous learning capability ensures that the test data remains relevant and effective throughout the development lifecycle.
Conclusion
AI is transforming the way test data is generated for software application testing. By leveraging machine learning, natural language processing, and other AI techniques, organizations can create test data that is more realistic, diverse, and scalable than ever before. While challenges remain, particularly in the areas of data quality and ethical considerations, the benefits of AI-driven test data generation are clear. As AI technology continues to evolve, it will play an increasingly important role in ensuring the quality and reliability of software applications in a rapidly changing digital landscape.
?