登录查看更多内容

Synthetic data generation for ML models with LLMs - An application of ChatGPT to automation

Trung Nguyen

Head of ML | Data Science | AI Advisor | MLOps | NLP | LLMs | GenAI | EdTech

发布日期: 2023年5月27日

Machine-learning-powered applications are data-hungry

Content generation is perhaps one of the most popular applications of generative AI, as it assists writers, marketers, and creators in generating ideas, content, and crafting compelling narratives.

Though not as widely known, data science and machine learning (ML) are other domains that can benefit from the transformative power of Large Language Models (LLMs) to generate data. In many ML-powered applications, the main bottleneck to achieving high performance is obtaining quality but expensive data, especially when subject matter experts (SMEs) are involved.

As an example, during my time at Suncorp, I worked on a project to automate quality assurance (QA). Our goal was to automatically verify compliance with certain regulatory and process requirements for all claims. Manual quality assurance could only be performed on less than 2% of the claims due to its high cost. We formulated certain checks as a text classification problem, which required a dataset of call transcripts and the corresponding QA questions with labels provided by the claim SMEs.

We estimated the cost to be in the millions, with the majority of the expenses attributed to the involvement of SMEs. However, the cost was not our only challenge. We also faced limitations in accessing the SMEs due to their commitments to their primary work duties.

How can LLMs help?

In the QA task mentioned above, the data annotation or labelling process would typically require SMEs to listen to the call recording and/or read the call transcript and indicate the outcome of each question. These questions could include whether the customer's identity was verified or if they were satisfied with the response. Interestingly, LLMs have the capability to generate not just the call transcript but also the answers to specific QA questions.

Below are the prompts I used with ChatGPT, where the context is a phone call between a banking customer and a customer support specialist.

Prompt 1: Generate the call transcript with the required details.

Generate a call transcript between a bank customer and a customer service speacialist/agent
Customer details:
- name: Trung Nguyen
- DOB: 11/11/1991
- Account #: 12345678

- The agent first greet the customer and ask 'How can I help'.
- The customer briefly describe his issue: bank account was overcharged
- The agent asks the customer to verify his identity for his name, DOB, account #.
- The customer confirms the details.
- The agent looks into the issue.
- The agent explains the issue and give his resolution.
- The customer confirms he is happy with the resolution.

Here, I provided my details directly in the prompt. However, this can easily be transformed into a template where the details of different customers and their inquiries can be used to generate a dataset with a larger number of examples.

And here is the response from ChatGPT.

David Sable 1 年前

Custom GPT 101: Your Own ChatGPT

Heather Murray 3 周前

Which AI Reigns Supreme? ChatGPT or Claude AI - The…

Joshua B. Lee 1 年前

Agent: Thank you for calling our bank's customer service. My name is Lisa. How can I help you today

Customer: Hi Lisa, my name is Trung Nguyen. I have an issue with my bank account. It seems like I've been overcharged.

Agent: I'm sorry to hear that, Mr. Nguyen. In order to assist you further, I'll need to verify your identity. Can you please provide me with your date of birth and account number?

Customer: Sure, my date of birth is November 11, 1991, and my account number is 12345678.

Agent: Thank you for confirming, Mr. Nguyen. Let me check your account and investigate the overcharge issue.

[Agent puts the customer on hold for a brief moment]

Agent: Mr. Nguyen, thank you for your patience. I have looked into your account, and I can see the issue with the overcharge. It appears that there was an error in processing a transaction, resulting in the incorrect amount being deducted from your account.

Customer: I'm glad you found the issue. What can be done to resolve it?

Agent: I apologize for the inconvenience caused. I have already reversed the incorrect charge, and the amount will be credited back to your account within 24 hours. Additionally, to make up for the inconvenience, we will also be refunding any associated fees that were charged due to this error.

Customer: That sounds fair to me. Thank you for taking care of it, Lisa. I appreciate your help.

Agent: You're welcome, Mr. Nguyen. I'm glad I could assist you. Is there anything else I can help you with today?

Customer: No, that will be all. I'm satisfied with the resolution. Thank you once again, Lisa.

Agent: It was my pleasure assisting you, Mr. Nguyen. If you have any further questions or concerns, please don't hesitate to reach out to us. Have a great day!

Customer: Thank you, Lisa. You too. Goodbye.

Agent: Goodbye, Mr. Nguyen. Take care.?

Prompt 2: Applying ChatGPT to extract details

Script:
```
Agent: Thank you for calling our bank's customer service. My name is Lisa. How can I help you today?
...

Answer the following questions based on the above script enclosed in ```.  

- 'Did the agent verify the customer's DOB?'
- 'Did the agent verify the customer's account number?'
- 'Was the customer happy with the outcome?'

For each question, answer as a JSON object with the following format:
{
   'question': the question,
   'answer': Yes or No,
   'evidence': the text in the script where this happened
}

With the generated script from the 1st prompt, I asked ChatGPT to generate the answer for each of the QA questions, which includes a Yes/No outcome as well as the text evidence in the transcript supporting the answer.

Here is the response.

{
"question": "Did the agent verify the customer's DOB?",
"answer": "Yes",
"evidence": "Agent: I'm sorry to hear that, Mr. Nguyen. In order to assist you further, I'll need to verify your identity. Can you please provide me with your date of birth and account number?\n\nCustomer: Sure, my date of birth is November 11, 1991, and my account number is 12345678."
},
{
"question": "Did the agent verify the customer's account number?",
"answer": "Yes",
"evidence": "Agent: I'm sorry to hear that, Mr. Nguyen. In order to assist you further, I'll need to verify your identity. Can you please provide me with your date of birth and account number?\n\nCustomer: Sure, my date of birth is November 11, 1991, and my account number is 12345678."
},
{
"question": "Was the customer happy with the outcome?",
"answer": "Yes",
"evidence": "Customer: That sounds fair to me. Thank you for taking care of it, Lisa. I appreciate your help."
},
{
"question": "Did the agent inform the customer of the exact amount involved?",
"answer": "No",
"evidence": "None"
}
][

We see that ChatGPT gave the correct answers for all four questions.

What are the implications?

We can observe that ChatGPT provided the correct answer for the second prompt without any training. This suggests that it can already be used as a model for QA automation. So, why do we need to generate data as described in the first prompt?

Recall that we provided ChatGPT with specific details to generate as part of its script. By configuring these details for different scenarios that we anticipate, we can generate a synthetic dataset that emulates the patterns we expect to encounter in our actual data.

The workflow can be summarized as follows:

Create a template.
Generate data and its labels based on the actual values used with the template.
Train the model using a combination of the generated data and real (historical) data.
Evaluate the performance of the model on real data.

By following this workflow, we could 1) significantly increase the amount of data for training and reduce the time and cost for manual data annotation and 2) enhance the model's ability to handle various scenarios and improve its performance on real-world data.

Ashok Kumar

Principal, Data Scientist @ AT&T Chief Data Office

1 年

Nice !

Juji, Inc.

1 年

Thanks for Sharing! ?? Trung Nguyen, PhD

1 次回应

Trung Nguyen

1 年

Craig Price what do you think? This could have made QA Automation so much more feasible :)?

查看更多评论

要查看或添加评论，请登录

查看全部

Synthetic data generation for ML models with LLMs - An application of ChatGPT to automation

Trung Nguyen

Head of ML | Data Science | AI Advisor | MLOps | NLP | LLMs | GenAI | EdTech

Machine-learning-powered applications are data-hungry

How can LLMs help?

Prompt 1: Generate the call transcript with the required details.

领英推荐

Prompt 2: Applying ChatGPT to extract details

What are the implications?

更多精彩文章

社区洞察

其他会员也浏览了

My Thoughts on GenAI in Financial Services

‘A certain danger lurks there’: how the inventor of the first chatbot turned against AI

ChatGPT beware: How to spot AI generated Text ?? ??

The release of new AI chatbot has led to speculations and sparked the age-old man vs machine debate.

How To Lead And Manage Your Company’s Use of ChatGPT and Generative AI.

The Art of Talking to AI: Mastering Prompt Engineering

The Logic of GPT Prompt Structuring

Claude vs. ChatGPT: Which AI Suits Your Needs?

From Superhuman Dreams to AI Reality: ChatGPT's Code Interpreter Revolution

K-AI Assistant: Data Analysis with ChatGPT & KNIME

Machine-learning-powered applications are data-hungry

How can LLMs help?

Prompt 1: Generate the call transcript with the required details.

领英推荐

Prompt 2: Applying ChatGPT to extract details

What are the implications?

DuckDB - A Superb Tool for Data Wrangling and Feature Engineering

2023年7月24日

Transforming Customer Experiences with ChatGPT-powered Chatbots

2023年5月21日

Trust No One - The Disciplined Data Scientist / Analyst

2020年1月9日

Business processes and data onboarding

2018年2月19日

THE 20% GUIDE TO A GOOD SINGLE VIEW OF CUSTOMER

2018年1月15日

Building your own data science platform

2017年11月17日

And the best cryptocurrencies are..

2017年11月6日

Building a receipt chatbot

2017年9月15日

The Lemonade bot

2017年7月4日

Predicting rental listing interest - Kaggle competition

2017年5月3日

社区洞察

其他会员也浏览了

My Thoughts on GenAI in Financial Services

‘A certain danger lurks there’: how the inventor of the first chatbot turned against AI

ChatGPT beware: How to spot AI generated Text ?? ??

The release of new AI chatbot has led to speculations and sparked the age-old man vs machine debate.

How To Lead And Manage Your Company’s Use of ChatGPT and Generative AI.

The Art of Talking to AI: Mastering Prompt Engineering

The Logic of GPT Prompt Structuring

Claude vs. ChatGPT: Which AI Suits Your Needs?

From Superhuman Dreams to AI Reality: ChatGPT's Code Interpreter Revolution

K-AI Assistant: Data Analysis with ChatGPT & KNIME