登录查看更多内容

Generative AI for tabular data explanation: prompt limit is not a limit

Simone Romano

Associate Partner - AI & Analytics Practice Leader

发布日期: 2023年10月18日

INTRODUCTION

Generative AI, and in particular large language models (LLMs), have being experimented to summarise texts, to extract keywords from text, to navigate text in natural language, etc.

A capability of LLMs is also related to the possibility to navigate complex data, extract useful information and arrange them in a specific required shape.

For example, consider the following scenario: you are a business analyst working for a retails company and you need to segment your customers, based on annual data of customers’ purchases. In a traditional world (what happened until the beginning of 2023, before of the generative AI revolution exploded), the standard approach to address this business question was similar to the following:

analyse data
design a data pipeline to arrange your data in a predefined shape, compatible with a Dana science pipeline
implement a traditional ai segmentation algorithm (for instance, an unsupervised clustering)
validate results
Create a dashboard
move your process to production

For complex cases, this can be a project with an elapsed of some month, involving experienced data scientists, business analysts, data architects, data engineer, testers, and UI experts.

LLMs open door to a parallel travel: they could segment customers data in a simple way, if we identified a governed approach and overcome some limitations (for example, prompt token size) in a smart way.

LARGE TABULAR DATA NAVIGATION: APPROACHES

Token limits in model implementations restrict the number of tokens processed in a single interaction to ensure efficient performance. The relationship between a token and a word is the following:

1 token ~= ? words. 100 tokens ~= 75 words.

For example, ChatGPT 3 has a 4096-token limit, GPT4 (8K) has an 8000-token limit and GPT4 (32K) has a 32000-token limit.

Saying that, if you going back to our customers’ segmentation use case, we have two scenarios:

We would like to analyse customers’ purchase file that can be managed in a single prompt
We would like to analyse customers’ purchase file that cannot be managed in a single prompt

For case 1, the approach can be more simple and resumed in following steps:

generate a prompt for our LLMs asking for specific customers’ segmentation task
include data in the prompt
wait for outcome generation
analyse data

Image below report a simple workflow for scenario 1.

Analyse tabular data that can be managed in a single prompt.

For case 2, the scenario is more complex: we need to identify a way to go beyond prompt size limit. Considering that a capability of LLMs is to generate programming code, what we can do is to ask to put model to generate a Python code to segment our data and generate a csv file with the evidences.

Once we have this outcome, we can run our Python code, get the output and analyse it in an excel file: the game is done!

After a tests phase, this code could be injected into a production workload and used in day by day analyses.

Image below report a simple workflow for scenario 2.

Analyse tabular data that cannot be managed in a single prompt.

Demo

Consider following example of simulated customers purchases data, generated by a LLMs running on IBM watsonx product:

领英推荐

Top AI & Machine Learning Newsletters of 2023

Michael Spencer 1 年前

Multimodal Retrieval Augmented Generation…

Open Data Science Conference (ODSC) 1 年前

The future of advanced AI is simple

Sridhar Ramaswamy 1 年前

Prompt launched on IBM watsonx to generate customers purchases data.

Example of LLM synthetic data generation.

Let me simulate the two scenarios I described above (data can be inject into a simple prompt or data too large for a single prompt).

SCENARIO 1

The idea of scenario 1 is to ask to LLM to generate a segementation of my customers, based on purchases data, injecting all data in my prompt. As per best practices, I will organise the prompt with a context, asking to the model to act as a retailer expert, and to segment the data given as input. Another best practice is to ask to the model to arrange the output in a standard format: this is a crucial step if you need to integrate the output of the model in a production workload, avoiding data management exception (for instance, data parsing).

Below a prompt that can answer to this need:

Example of a prompt to generate a segmentation from users' purchasing history with IBM watsons.

We are asking to organise data in groups with the constraint indicating that each pattern must be in one group, and we are asking for groups explanation. Also, we are asking for a csv output format.

Below the output of the model:

As you can see, the model identified 4 groups, with following auto-generated explanation:

Group A: Customers who purchased appliances using a credit card.
Group B: Customers who purchased appliances using cash.
Group C: Customers who purchased appliances using a debit card.
Group D: Customers who purchased small kitchen appliances.

This is a simple approach to consider if you have few data that can be included in a single prompt.

SCENARIO 2

In this scenario, the objective is to analyse data with a dimensions that is not compatible with prompt token's limit.

The approach here changes: we will ask to the model to generate a Python code that will apply an unsupervised traditional AI approach (clustering algorithm) to create a clustering of our data, and to generate the output in csv file.

Below the prompt I defined to generate the Python code to group customers:

Example of prompt to generate a Python code to cluster data.

As you can see, the prompt just provides an example of record and a single record. Also, the request is to load data from file and generate a new file in output, named "clustering_output.csv".

Below the output from the model:

Auto-generated Python code using IBM watsonx.

As you can see, the model include a preprocessing steps, where categorical columns (like "Type of appliace") are converted in numerical values, to be coherent with clustering algorithm expectations. Also, the KMeans algorithm, has been used to generate a 3 groups cluster and a new file ("clustering_output.csv") has been generated. Everything you need is a Python environment to run this code and get the output. Obviously, I did it, below the results:

Auto-generated file with cluster id for each user.

As you can see from the picture (just a portion of the results), each record has been assigned to a cluster ID (from 0 to 2). You have now your first users segmentation done!

CONCLUSIONS

In this article I explained how generative AI can be used to interpret tabular data, sharing also an approach to manipulate large dataset using code generation feature of large language models.

This example can be "well-architected" and injected into a production workload, opening the scene to really new data exploration era.

#generativeAI #segmentation #IBM #watsonx #tabulardataanalysis #beyondPromptLimit #syntheticdata

要查看或添加评论，请登录

Simone Romano的更多文章

Music composition and rapid prototyping with generative AI and IBM watsonx

2024年5月8日

Music composition and rapid prototyping with generative AI and IBM watsonx

Introduction Welcome back to the fascinating world of GenAI, this time to investigate its powerful capabilities in…

4 条评论
Revolutionizing Document Management in SAP with Generative AI

2024年2月1日

Revolutionizing Document Management in SAP with Generative AI

Introduction Extracting information from digitized documents, such as photos or scans, can be a challenging task…
Generative AI happened

2024年1月10日

Generative AI happened

Last year was a special year for the artificial intelligence: as I mention in my last blogs, "Generative AI Happened"…

4 条评论
Generative AI to improve OCR

2024年1月10日

Generative AI to improve OCR

Introduction Optical Character Recognition (OCR) technology plays a crucial role in various industry sectors by…

1 条评论
Innovative approach to AI project delivery with Generative AI

2023年11月24日

Innovative approach to AI project delivery with Generative AI

Introduction Traditional AI is really effective to address specific use cases, supported by data scientists team and…
Unlocking the power of generative AI to visualize functional requirements

2023年10月24日

Unlocking the power of generative AI to visualize functional requirements

Introduction One of major time-consuming activity for an IT architect is to convert functional requirements of an IT…
AI pipeline to "play a picture of a musical score", and its implication in generative AI

2023年10月15日

AI pipeline to "play a picture of a musical score", and its implication in generative AI

Introduction Understand, interpret and listen the content of a musical score is something difficult if you are not a…

5 条评论
Talking with a GraphDB leveraging generative AI

2023年10月13日

Talking with a GraphDB leveraging generative AI

Have you ever wondered if it's possible to navigate a graph database using the power of generative AI? The answer is a…

2 条评论
Generative AI impact on data platform solutions

2023年10月8日

Generative AI impact on data platform solutions

Introduction Cognitive enterprises exist. Many organisations reshaped themself in last decade creating data driven…

1 条评论
Serverless streaming job on IBM Cloud

2023年2月13日

Serverless streaming job on IBM Cloud

Introduction If you are designing a solution on IBM Cloud including streaming data ingestion flow, you must consider…

See all articles

Generative AI for tabular data explanation: prompt limit is not a limit

Simone Romano

Associate Partner - AI & Analytics Practice Leader

INTRODUCTION

LARGE TABULAR DATA NAVIGATION: APPROACHES

Demo

领英推荐

SCENARIO 1

SCENARIO 2

CONCLUSIONS

Simone Romano的更多文章

社区洞察

其他会员也浏览了

The Role of AI in Revolutionizing Data Analytics?

The Unsung Heroes of AI: Data Annotation, Synthetic Data, and Real-Time Data Curation

The Transformative Power of Generative AI in Business Intelligence

Janus Pro 7B: Key Features, Benefits & Drawbacks

Custom AI Solutions: Tailoring Transformer Model Development Services to Your Business Needs

Enhancing Efficiency with Generative AI: Automating Multi-Language Image and Text Extraction

I-JEPA: Advancing Human-Like AI Through Predictive World Models

If automation handles automation, what should we focus on?

The Hidden Language of AI: A Deep Dive into Embeddings

LLaMA 3: Revolutionizing the Landscape of Open-Source AI

INTRODUCTION

LARGE TABULAR DATA NAVIGATION: APPROACHES

Demo

领英推荐

SCENARIO 1

SCENARIO 2

CONCLUSIONS

Simone Romano的更多文章

Music composition and rapid prototyping with generative AI and IBM watsonx

Revolutionizing Document Management in SAP with Generative AI

Generative AI happened

Generative AI to improve OCR

Innovative approach to AI project delivery with Generative AI

Unlocking the power of generative AI to visualize functional requirements

AI pipeline to "play a picture of a musical score", and its implication in generative AI

Talking with a GraphDB leveraging generative AI

Generative AI impact on data platform solutions

Serverless streaming job on IBM Cloud

社区洞察

其他会员也浏览了

The Role of AI in Revolutionizing Data Analytics?

The Unsung Heroes of AI: Data Annotation, Synthetic Data, and Real-Time Data Curation

The Transformative Power of Generative AI in Business Intelligence

Janus Pro 7B: Key Features, Benefits & Drawbacks

Custom AI Solutions: Tailoring Transformer Model Development Services to Your Business Needs

Enhancing Efficiency with Generative AI: Automating Multi-Language Image and Text Extraction

I-JEPA: Advancing Human-Like AI Through Predictive World Models

If automation handles automation, what should we focus on?

The Hidden Language of AI: A Deep Dive into Embeddings

LLaMA 3: Revolutionizing the Landscape of Open-Source AI