Using ChatGPT for Data Analysis #2
I had written an earlier post about using ChatGPT for Data Analysis where I was trying to rationalize the probabilistic nature of ChatGPT (I get different answers when I ask the same question) with the deterministic nature of data analysis (Average of 4+6 = 5. Doesn't matter how many time I ask this question).
In that post I had broken down how data analysis is done in ChatGPT Data Analyst and how it is leveraging python for the deterministic part. That being said, the probabilistic nature of GenAI can introduce errors and In this post, I wanted to take an example and see the results for myself so I can understand the guardrails.
I googled for sample data and got this link to some sales data. I downloaded this into excel and asked my first question -
Please review the attached excel and explain what this data set is about
The first response was -
When I asked the question again the response was -
This is a good example of how ChatGPT combines deterministic responses with probabilistic responses. The column names (deterministic - python) did not change (as I would hope) but all the text around it (probabilistic - GenAI) while similar changes.
Then I asked it a 3rd time and this was the result -
Overall, I am ok with these responses. While all 3 are different, they are accurate and I get the point. That being said, the one place where error can creep in is in the descriptions of each field name. My data set did not have any descriptors of the field names.
Rep - "The sales representative responsible for the sale is recorded in this column. This information can be used to evaluate the performance of individual sales reps and to provide targeted feedback or incentives."
While this is probably accurate, one can see how a hallucination could be a problem here. Those of us who have worked at enterprises know how field names aren't always the most descriptive.
Then I asked it - "what is the sales per rep per year with year on columns and rep on rows"
These were the responses -
领英推荐
So how is it doing the deterministic part? It is running python code -
So what assumptions did GenAI make based on my prompt? The prompt was - "what is the sales per rep per year with year on columns and rep on rows".
It had to translate what I meant by "Sales", "Rep", "Year". It was smart enough to know that Sales = 'Total' and Rep = 'Rep' and Year = 'Year from the Order Date' field.
What if I had meant Sales as in total number of units sold? I would have to change the prompt to - "what is the sales in units per rep per year with year on columns and rep on rows"
This is no different than what I would do in Tableau. Drag and drop the Total field and if I wanted Units, I'd drag that field over.
When I asked it to generate sales in units per rep per year it generated this code -
# Calculate total units sold per rep per year
units_per_rep_per_year = df.pivot_table(index='Rep', columns='Year', values='Units', aggfunc='sum')
tools.display_dataframe_to_user(name="Units Sold Per Rep Per Year", dataframe=units_per_rep_per_year)
# Display the result
units_per_rep_per_year
Somehow it knew how to translate my query into these attributes - index='Rep', columns='Year', values='Units', aggfunc='sum' in the function df.pivot_table.
In my next post I will see if I can figure out what errors could creep in in the code generation piece of this black box.