Synthetic data generation for ML models with LLMs - An application of ChatGPT to automation
Trung Nguyen
Head of ML | Data Science | AI Advisor | MLOps | NLP | LLMs | GenAI | EdTech
Machine-learning-powered applications are data-hungry
Content generation is perhaps one of the most popular applications of generative AI, as it assists writers, marketers, and creators in generating ideas, content, and crafting compelling narratives.
Though not as widely known, data science and machine learning (ML) are other domains that can benefit from the transformative power of Large Language Models (LLMs) to generate data. In many ML-powered applications, the main bottleneck to achieving high performance is obtaining quality but expensive data, especially when subject matter experts (SMEs) are involved.
As an example, during my time at Suncorp, I worked on a project to automate quality assurance (QA). Our goal was to automatically verify compliance with certain regulatory and process requirements for all claims. Manual quality assurance could only be performed on less than 2% of the claims due to its high cost. We formulated certain checks as a text classification problem, which required a dataset of call transcripts and the corresponding QA questions with labels provided by the claim SMEs.
We estimated the cost to be in the millions, with the majority of the expenses attributed to the involvement of SMEs. However, the cost was not our only challenge. We also faced limitations in accessing the SMEs due to their commitments to their primary work duties.
How can LLMs help?
In the QA task mentioned above, the data annotation or labelling process would typically require SMEs to listen to the call recording and/or read the call transcript and indicate the outcome of each question. These questions could include whether the customer's identity was verified or if they were satisfied with the response. Interestingly, LLMs have the capability to generate not just the call transcript but also the answers to specific QA questions.
Below are the prompts I used with ChatGPT, where the context is a phone call between a banking customer and a customer support specialist.
Prompt 1: Generate the call transcript with the required details.
Generate a call transcript between a bank customer and a customer service speacialist/agent
Customer details:
- name: Trung Nguyen
- DOB: 11/11/1991
- Account #: 12345678
- The agent first greet the customer and ask 'How can I help'.
- The customer briefly describe his issue: bank account was overcharged
- The agent asks the customer to verify his identity for his name, DOB, account #.
- The customer confirms the details.
- The agent looks into the issue.
- The agent explains the issue and give his resolution.
- The customer confirms he is happy with the resolution.
Here, I provided my details directly in the prompt. However, this can easily be transformed into a template where the details of different customers and their inquiries can be used to generate a dataset with a larger number of examples.
And here is the response from ChatGPT.
领英推荐
Agent: Thank you for calling our bank's customer service. My name is Lisa. How can I help you today
Customer: Hi Lisa, my name is Trung Nguyen. I have an issue with my bank account. It seems like I've been overcharged.
Agent: I'm sorry to hear that, Mr. Nguyen. In order to assist you further, I'll need to verify your identity. Can you please provide me with your date of birth and account number?
Customer: Sure, my date of birth is November 11, 1991, and my account number is 12345678.
Agent: Thank you for confirming, Mr. Nguyen. Let me check your account and investigate the overcharge issue.
[Agent puts the customer on hold for a brief moment]
Agent: Mr. Nguyen, thank you for your patience. I have looked into your account, and I can see the issue with the overcharge. It appears that there was an error in processing a transaction, resulting in the incorrect amount being deducted from your account.
Customer: I'm glad you found the issue. What can be done to resolve it?
Agent: I apologize for the inconvenience caused. I have already reversed the incorrect charge, and the amount will be credited back to your account within 24 hours. Additionally, to make up for the inconvenience, we will also be refunding any associated fees that were charged due to this error.
Customer: That sounds fair to me. Thank you for taking care of it, Lisa. I appreciate your help.
Agent: You're welcome, Mr. Nguyen. I'm glad I could assist you. Is there anything else I can help you with today?
Customer: No, that will be all. I'm satisfied with the resolution. Thank you once again, Lisa.
Agent: It was my pleasure assisting you, Mr. Nguyen. If you have any further questions or concerns, please don't hesitate to reach out to us. Have a great day!
Customer: Thank you, Lisa. You too. Goodbye.
Agent: Goodbye, Mr. Nguyen. Take care.?
Prompt 2: Applying ChatGPT to extract details
Script:
```
Agent: Thank you for calling our bank's customer service. My name is Lisa. How can I help you today?
...
Answer the following questions based on the above script enclosed in ```.
- 'Did the agent verify the customer's DOB?'
- 'Did the agent verify the customer's account number?'
- 'Was the customer happy with the outcome?'
For each question, answer as a JSON object with the following format:
{
'question': the question,
'answer': Yes or No,
'evidence': the text in the script where this happened
}
With the generated script from the 1st prompt, I asked ChatGPT to generate the answer for each of the QA questions, which includes a Yes/No outcome as well as the text evidence in the transcript supporting the answer.
Here is the response.
{
"question": "Did the agent verify the customer's DOB?",
"answer": "Yes",
"evidence": "Agent: I'm sorry to hear that, Mr. Nguyen. In order to assist you further, I'll need to verify your identity. Can you please provide me with your date of birth and account number?\n\nCustomer: Sure, my date of birth is November 11, 1991, and my account number is 12345678."
},
{
"question": "Did the agent verify the customer's account number?",
"answer": "Yes",
"evidence": "Agent: I'm sorry to hear that, Mr. Nguyen. In order to assist you further, I'll need to verify your identity. Can you please provide me with your date of birth and account number?\n\nCustomer: Sure, my date of birth is November 11, 1991, and my account number is 12345678."
},
{
"question": "Was the customer happy with the outcome?",
"answer": "Yes",
"evidence": "Customer: That sounds fair to me. Thank you for taking care of it, Lisa. I appreciate your help."
},
{
"question": "Did the agent inform the customer of the exact amount involved?",
"answer": "No",
"evidence": "None"
}
][
We see that ChatGPT gave the correct answers for all four questions.
What are the implications?
We can observe that ChatGPT provided the correct answer for the second prompt without any training. This suggests that it can already be used as a model for QA automation. So, why do we need to generate data as described in the first prompt?
Recall that we provided ChatGPT with specific details to generate as part of its script. By configuring these details for different scenarios that we anticipate, we can generate a synthetic dataset that emulates the patterns we expect to encounter in our actual data.
The workflow can be summarized as follows:
By following this workflow, we could 1) significantly increase the amount of data for training and reduce the time and cost for manual data annotation and 2) enhance the model's ability to handle various scenarios and improve its performance on real-world data.
Principal, Data Scientist @ AT&T Chief Data Office
1 年Nice !
Thanks for Sharing! ?? Trung Nguyen, PhD
Head of ML | Data Science | AI Advisor | MLOps | NLP | LLMs | GenAI | EdTech
1 年Craig Price what do you think? This could have made QA Automation so much more feasible :)?