登录查看更多内容

Analytics for Everyone | Data driven decisions using ChatGPT

Olivier Soucy

Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner

发布日期: 2024年5月21日

Preface: If you are non-technical but have an interest in technology or data-driven decision-making, please keep reading. While most of my blog posts are so niche that only about 0.002% of LinkedIn users can understand them, this one should be of interest to all of you.

Last week, a good friend of mine, Véronique Desjardins from Fondation Jeunesses Musicales Canada , reached out to me about calculating some metrics related to donations they received over the last few years. The task was straightforward: analyze an Excel spreadsheet with about 1,000 rows to determine the average donation, retention rates, donation trends and other metrics. This is a task that most managers, business owners, and decision-makers frequently face in today's data-rich world.

At first, I felt a warm surge of excitement at the thought of using my trusty old friend, Pandas, to geek the heck out of this assignment. Having run so many similar analyses in the past, I figured I could spend an hour or two on this problem, extract some numbers, and be genuinely helpful. But then I started thinking...

Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime.

In this era of self-service analytics and user-friendly tools, there must be a way for a non-technical person with a limited budget to pull out these numbers without hiring a data analyst. I wanted her to be able to run other analysis on her own. How about using ChatGPT? I've never used it for this kind of task before, but given its versatility, it should be able to handle this as well.

The Process

Disclaimer: To prevent exposing actual donation amounts made to Jeunesses Musicales Canada, only mock data has been used in writing this analysis. Any metrics listed below are based on fictitious values.

The task at hand was simpler than what I usually tackle, but I decided to apply the same principles:

Data preparation and cleansing
Exploratory Data Analysis
Validation
Reporting

Data Engineering (aka export as CSV)

The Excel spreadsheet was already decently formatted, so it took very little effort to get it into an analytics-ready state. Firing up Excel for the first time in a while, I opened the document, deleted the non-relevant columns, and exported the data to a CSV file—pretty easy, right? Beyond that, the only extra step was to anonymize donor names using a hashing function before uploading the dataset to a pseudo-public domain.

Data Analytics (aka ask GPT)

Now it was time for the real test. In my past experience, GPT (at least GPT-3.5) has never been particularly strong with numbers. While its statistical nature has proven to yield extraordinary results for text generation, pure arithmetic has never been one of its fortes. Still, let's give this a shot and see how it goes.

Right after uploading the CSV, it was able to understand the nature of each column:

A good start!

Next, I decided to try something a bit more interesting, like asking for the average and total donation values for each year.

Average donation value and total donations value for each year

This seemed about right. Double-checking a few numbers showed that it was actually spot on. I started to get curious about the mechanics behind this and how exactly the algorithm was computing the numbers:

Then, I started noticing the "please let me know if you want to know more" link.

Which lead me to:

Holy crap! GPT is pulling out Pandas and writing Python, exactly as I would have done myself. Pushing my luck, I asked:

Can you generate a plot for the average donation vs. year?

Once this nicely formatted plot came out, I knew I was in for a treat. I went all in and asked:

Build a report presenting the following items:

Generate a graph for the average donation value vs each fiscal year
Generate a graph for the donation standard deviation value vs each fiscal year
Generate a graph for the maximum donation value vs each fiscal year
Generate a graph for the number of donor vs each fiscal year
Generate a graph for how how many donors have donated for x amount of fiscal years
List the 10 donors with the most years donated
List the 10 donors with the large amount donated
Identify how many donors have increased their donations throughout the fiscal years
Identify how many donors have decreased their donations throughout the fiscal years
Identify how many donors have kept similar donations throughout the fiscal years
Generate a graph that shows what percentage of donors that have donated on a given year also donated on all subsequent years.

And I was not disappointed. GTP wrote all the code required to answer these questions and summarized the results as a prompt output. Here are a few notable examples:

Validation (aka Excel Sum)

Given that both the dataset and the requested metrics were relatively simple, I was confident that the results would be accurate. Nonetheless, a quick double-check of these data points confirmed that they were always correct. Anothering interesting approach would be to copy and paste the code to another prompt and ask ChatGPT to explain what the code is doing and if he believes there is any error in it.

Reporting (aka click share button)

I clicked on the "share chat" link at the top and sent the link to my friend.

That was it. No further context needed to be provided. No step-by-step instructions had to be written. It was self-explanatory and completely usable for a non-technical user. Mission accomplished. She could use the prompt as an example and ask for any additional metric she was interested in.

Conclusion

It's easy for data engineers like myself to get excited about infinitely scalable frameworks and highly advanced technologies, but we often lose sight of what really matters: telling a story that helps businesses (or systems) make better decisions and empowering them to autonomously produce similar results in different contexts.

Understanding and being able to articulate the process (data cleaning, preparation, analysis, report) is where we truly add value. The tools we use to get the job done are merely a means to an end, and they should scale down with the problem at hand.

With technologies like GPT, simple data analytics is accessible to anyone. It's just a matter of educating them on how to get there and how to critically evaluate the results. With the right training, anyone can run its own analysis.

Have you used ChatGPT or other LLMs for data analytics work? How was your experience?

And don't forget to follow me for more great content about data engineering.

Véronique Desjardins

Executive Director of the Foundation at Jeunesses Musicales Canada

10 个月

Thank you Olivier Soucy for your help ! For a one full-time employee in a non-profit organization, knowing how to use tools like this well (aka more than just having chat GPT help me speed up my communications) is amazing!

Marcelo Grebois

? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level

10 个月

Exploring data in Excel can be a maze! Thankfully, AI like ChatGPT simplifies analysis for non-techies. Check the article for insights on data empowerment and engineering best practices. #dataanalysis Olivier Soucy

1 次回应

查看更多评论

要查看或添加评论，请登录

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

2024年10月15日

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…

1 条评论
Lakehouse as Code | 03. Data Pipeline Jobs

2024年10月8日

Lakehouse as Code | 03. Data Pipeline Jobs

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 02. Workspace

2024年9月30日

Lakehouse as Code | 02. Workspace

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 01. Unity Catalog

2024年9月24日

Lakehouse as Code | 01. Unity Catalog

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Data Pipelines | To merge, or not to merge

2024年9月10日

Data Pipelines | To merge, or not to merge

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok…
Unity Catalog | 3 levels to rule them all

2024年9月3日

Unity Catalog | 3 levels to rule them all

In May 2021, Databricks introduced Unity Catalog (UC), promising a unified governance layer designed to streamline the…

5 条评论
Databricks AI Playground | How to bring your own model

2024年8月26日

Databricks AI Playground | How to bring your own model

After a few months in public preview, Databricks AI Playground has garnered great feedback from the community. But if…

1 条评论
Building a Data Pipeline with Polars and Laktory

2024年7月8日

Building a Data Pipeline with Polars and Laktory

When discussing data pipelines, distributed engines like Spark and big data platforms such as Databricks and Snowflake…

1 条评论
DataFrames Battle Royale | Pandas vs Polars vs Spark

2024年5月31日

DataFrames Battle Royale | Pandas vs Polars vs Spark

Usually, when I sit down to write these blog posts, I have a clear direction in mind. This time, however, it's a bit…

4 条评论
Mastering Streaming Data Pipelines with Kappa Architecture

2024年5月16日

Mastering Streaming Data Pipelines with Kappa Architecture

These days, experience with streaming data is a common requirement in most data engineering job postings. It seems that…

2 条评论

See all articles

The Process

Data Engineering (aka export as CSV)

Data Analytics (aka ask GPT)

Validation (aka Excel Sum)

Reporting (aka click share button)

Conclusion

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Lakehouse as Code | 03. Data Pipeline Jobs

Lakehouse as Code | 02. Workspace

Lakehouse as Code | 01. Unity Catalog

Data Pipelines | To merge, or not to merge

Unity Catalog | 3 levels to rule them all

Databricks AI Playground | How to bring your own model

Building a Data Pipeline with Polars and Laktory

DataFrames Battle Royale | Pandas vs Polars vs Spark

Mastering Streaming Data Pipelines with Kappa Architecture