Generative AI for tabular data explanation: prompt limit is not a limit
INTRODUCTION
Generative AI, and in particular large language models (LLMs), have being experimented to summarise texts, to extract keywords from text, to navigate text in natural language, etc.
A capability of LLMs is also related to the possibility to navigate complex data, extract useful information and arrange them in a specific required shape.
For example, consider the following scenario: you are a business analyst working for a retails company and you need to segment your customers, based on annual data of customers’ purchases. In a traditional world (what happened until the beginning of 2023, before of the generative AI revolution exploded), the standard approach to address this business question was similar to the following:
For complex cases, this can be a project with an elapsed of some month, involving experienced data scientists, business analysts, data architects, data engineer, testers, and UI experts.
LLMs open door to a parallel travel: they could segment customers data in a simple way, if we identified a governed approach and overcome some limitations (for example, prompt token size) in a smart way.
LARGE TABULAR DATA NAVIGATION: APPROACHES
Token limits in model implementations restrict the number of tokens processed in a single interaction to ensure efficient performance. The relationship between a token and a word is the following:
1 token ~= ? words. 100 tokens ~= 75 words.
For example, ChatGPT 3 has a 4096-token limit, GPT4 (8K) has an 8000-token limit and GPT4 (32K) has a 32000-token limit.
Saying that, if you going back to our customers’ segmentation use case, we have two scenarios:
For case 1, the approach can be more simple and resumed in following steps:
Image below report a simple workflow for scenario 1.
For case 2, the scenario is more complex: we need to identify a way to go beyond prompt size limit. Considering that a capability of LLMs is to generate programming code, what we can do is to ask to put model to generate a Python code to segment our data and generate a csv file with the evidences.
Once we have this outcome, we can run our Python code, get the output and analyse it in an excel file: the game is done!
After a tests phase, this code could be injected into a production workload and used in day by day analyses.
Image below report a simple workflow for scenario 2.
Demo
Consider following example of simulated customers purchases data, generated by a LLMs running on IBM watsonx product:
领英推荐
Let me simulate the two scenarios I described above (data can be inject into a simple prompt or data too large for a single prompt).
SCENARIO 1
The idea of scenario 1 is to ask to LLM to generate a segementation of my customers, based on purchases data, injecting all data in my prompt. As per best practices, I will organise the prompt with a context, asking to the model to act as a retailer expert, and to segment the data given as input. Another best practice is to ask to the model to arrange the output in a standard format: this is a crucial step if you need to integrate the output of the model in a production workload, avoiding data management exception (for instance, data parsing).
Below a prompt that can answer to this need:
We are asking to organise data in groups with the constraint indicating that each pattern must be in one group, and we are asking for groups explanation. Also, we are asking for a csv output format.
Below the output of the model:
As you can see, the model identified 4 groups, with following auto-generated explanation:
This is a simple approach to consider if you have few data that can be included in a single prompt.
SCENARIO 2
In this scenario, the objective is to analyse data with a dimensions that is not compatible with prompt token's limit.
The approach here changes: we will ask to the model to generate a Python code that will apply an unsupervised traditional AI approach (clustering algorithm) to create a clustering of our data, and to generate the output in csv file.
Below the prompt I defined to generate the Python code to group customers:
As you can see, the prompt just provides an example of record and a single record. Also, the request is to load data from file and generate a new file in output, named "clustering_output.csv".
Below the output from the model:
As you can see, the model include a preprocessing steps, where categorical columns (like "Type of appliace") are converted in numerical values, to be coherent with clustering algorithm expectations. Also, the KMeans algorithm, has been used to generate a 3 groups cluster and a new file ("clustering_output.csv") has been generated. Everything you need is a Python environment to run this code and get the output. Obviously, I did it, below the results:
As you can see from the picture (just a portion of the results), each record has been assigned to a cluster ID (from 0 to 2). You have now your first users segmentation done!
CONCLUSIONS
In this article I explained how generative AI can be used to interpret tabular data, sharing also an approach to manipulate large dataset using code generation feature of large language models.
This example can be "well-architected" and injected into a production workload, opening the scene to really new data exploration era.
#generativeAI #segmentation #IBM #watsonx #tabulardataanalysis #beyondPromptLimit #syntheticdata