Diary of an Architect Series - (3) ChatGPT (Advanced) Data Analysis and the Future
I started working on a larger article regarding augmented data engineering, but it is still a work in progress. To stay true to the diary format here, I should take you on the journey through ideas and influencing factors shaping my opinions and directions. I believe it will make for a much better story than putting everything together without this journey ;-).
As part of my preparations for the articles and a great deal of general curiosity, I was diving deeper into the (Advanced) Data Analysis tool available in ChatGPT. In my previous article Data Strategies for Gen. AI era I described a data pattern called Query/Code processing retrieval which has been significantly influenced by the ChatGPT (Advanced) Data Analysis tool, but apart from making for a fascinating pattern, it is also a major direction for imagining how the future data analysis and potentially data engineering work could look and feel like.
So, buckle up it is going to be a wild ride :-) with both - heights that will blow your mind and lows that will demonstrate why to stay vigilant and grounded.
What is an (Advanced) Data Analysis tool (in ChatGPT)
So, first things first, let's make a quick intro to the tool and some basics around it so we start from the same baseline of understanding.
Chat GPT (Advanced) Data Analysis Tool previously known as Code Interpreter is a tool only available to ChatGPT Plus users.
I really appreciated that the OpenAI team decided to change the name of this tool from Code Interpreter to Advanced Data Analysis because the original name did not express very well what this tool is capable of and therefore it did not directly catch the attention of the right user population.
(Editor note: OpenAI can't make up their mind :-) the tool is now called Data Analysis in the most recent iteration.)
In essence, the (Advanced) Data Analysis tool is a fascinating capability that is changing the way we can interact with data by giving us a direct natural language interface to analyze, transform, and visualize data. The ultimate no-code strategy, interestingly not only for no-coders but more on that later.
The most amazing feature of the tool is that it can directly interact with data that were provided and execute its own generated code in a sandbox environment contained for the chat session.
Where can I find it? How to enable it?
The tool is only available for ChatGPT Plus users and purchase of the Plus subscription is needed. To enable the tool for your session, select it from the GPTs "Explore" menu.
When the (Advanced) Data Analysis option is selected not much really changes. Apart from a paper clip sign added to the message bar. Even though recently after the new update, the paperclip sign in the left corner of the message bar is always present because of the feature that allows picture uploads. In the case of the Data Analysis tool though, the paper clip icon allows to upload CSV and other common format data files.
However don't be fooled by the fact that the UI is exactly the same as for any other GPT experience, because in the background magic really happens.
This is probably the most underestimated part of the tool. The whole chat interface does not give you almost any sign that it is in a different mode. In fact, the change is so small that many could even miss the fact that the tool is enabled or think that it is not working.
The tool has real potential to transcend beyond chat to really supercharge its value proposition. It should arguably be a whole separate experience from the classic chat because right now it is like using Adobe Photoshop with only being able to use chat commands. I'll expand on this later in the article.
What does it do? How does it work?
In the background, while the ChatGPT (A)DA tool is enabled, the user can upload files to a temporary environment that makes the data available for direct interaction inside the chat session.
What the (A)DA tool really does is that it translates user natural language instructions into Python code that is then directly executed inside the sandbox environment to manipulate or visualize the data depending on the user's requests.
Major benefit of this approach is that can "reliably" process large amounts of structured data.
The above is an important observation. In current Gen. AI paradigm processing large amounts of structured data is not as straightforward, because for structured data we don't have a direct natural language interface as we have for unstructured data (e.g. semantic/cognitive search).
Although structured data are able to be used directly within the LLM context the accuracy and reliability of processing is highly questionable and not very robust. Even with the largest 200k token contexts today, it is still limiting given the amount of structured data we have in our data stores. So the option that remains is to be able to process the data outside of LLM. In the same way, RAG emerged to tame the even larger unstructured data landscape, a similar perspective is necessary for structured data.
ChatGPT (A)DA tool does exactly that and adds the usual mind-bending flair of using Generative AI to do that. It is something entirely different to see this process in action.
I am talking a lot about data processing and manipulation, but the ChatGPT (A)DA tool can of course execute not only operations on the data but can really execute anything that one could do from a typical Python environment within e.g. Jupyter Notebook.
An example of the type of actions one can execute on the environment.
As I mentioned earlier the tool might suggest direction as an ultimate no-code companion, but in its current form, it is more geared toward code-aware users who want to quickly experiment and save time by not needing to write all the code by hand (like me :-)).
But this experience goes beyond just helping with writing code. No, in this case, one does not even have to think in terms of structuring or formatting the code. The equation is completely changed because the the outcome or result that one wants to achieve is always front and center.
Why I say it is completely changed is because think about it. In traditional data processing or visualization, we don't interact by describing our needs - we "know" what we want to achieve and so we configure tools or write code to achieve that goal. Here on the other hand it is quite different.
The code is written automatically, so the definition of the result is more important in the interactions, and progress is made by shaping and clarifying this outcome/goal definition.
In any case, I am not joking when I say that it helps me reduce the time for running medium-sized analyses from days to hours. That probably tells you something about my level of skill in writing code or my proficiency in using BI tools (haha), but it is sort of an incredible experience when I can think about my goals and results more than about the way I code it.
What I found by using the tool is that the augmentation allows me to choose one of two things.
Examples
The best way to explain what this tool can do is to look at a few examples. If you are interested in more detail focusing on the actual prompts and details of the tool capabilities please check out the companion article I've posted on the Medium platform where I'll keep some of the adjecant work.
To manage the length and engagement of this article I only summarize and make final takeaways here.
I am trying a new format to break down the work into manageable chunks so everyone can choose their experience diving deeper or getting the main thoughts.
领英推荐
I highly recommend checking out the Medium article as I believe it provides a good overview of the intricacies of using the tool and provides valuable insights on how to enhance the experience.
The Great, The Good, and The Ugly
(Editor note: below you will find some AI-generated content that was verified and curated by me. I used AI to help create the backbone of the content from my own article mentioned above to distill the core messages for parts of the summary section below.)
The Great
Have you ever been in a situation where technology not only meets but exceeds your expectations? That's exactly what happened with my deep dive into the ChatGPT (A)DA tool.
Where the tool truly shined is its ability to translate natural language queries into code automatically. It's one thing to conceptualize data analysis steps, but watching those steps take form as executable code in real time was a remarkable experience.
The most striking part was the tool’s capacity for auto-correction. If it hit a bump, it didn’t just stop; it adapted, corrected itself, and moved forward.
There’s something genuinely fascinating about seeing your data being understood and manipulated by an AI in a way that feels both intelligent and intuitive. It was this blend of precision, adaptability, and efficiency that made working with the (A)DA tool a standout experience in my data analysis exploration.
The Good
Now, let's shift our focus to the versatile capabilities of the ChatGPT (A)DA tool, which stood out in several practical aspects. The tool's versatility really came to the forefront when dealing with various data analysis tasks. Whether it was data manipulation or visualization, the tool consistently proved its utility.
One aspect that particularly impressed me was its data visualization capabilities. The ease and intuitiveness with which I could generate and tailor graphs and other visual representations were remarkable. It turned the usually time-consuming task of visualizing complex data sets into a more straightforward process. A task that would typically require a deep dive into Python libraries or the configuration of a reporting tool was genuinely reduced to only imagining and asking for the outcome. (Check out the companion article for real-world examples)
It is also different from conversational analytics tools which typically require considerable upfront preparation of the model and upfront curration of the data to be able to respond in a similar conversational style. The (A)DA tool is unique in a way that it does not need the data to be prepared in any particular way and in fact, the tool helps with data preparation and data curation tasks. Additionally, the flexibility and customization of the visualization are significantly more capable in the (A)DA tool given the direct use of advanced Python libraries like matplotlib, plotly, seaborn, pandas visualization, or bokeh....
This is not to say that conversational analytics tools are no longer necessary, in fact I believe the audiences for (A)DA and conversational analytics tools are different.
User persona for (A)DA tool may not be as straightforward as it might seem. The natural language interface would suggest no-code users, but in my experience, The tool caters to users who are familiar with coding but don't necessarily want to get bogged down in the minutiae of scripting every single command. It's like having a co-pilot who speaks your language, understands your objectives, and takes care of the heavy lifting, allowing you to focus more on strategy and insights rather than the coding itself, but the code is still an important part of the equation.
The ChatGPT (A)DA tool is about enhancing the way we work. It'll not be for everyone, but given the right persona fit, it'll have a significant positive impact on the speed, comfort, and abilities of those using it.
The Ugly
While the ChatGPT ADA tool has its remarkable strengths, it's only fair to discuss some of the limitations that I encountered. These challenges, though not deal-breakers are important to consider for anyone looking to use this tool effectively.
One significant issue was handling extremely large datasets. For instance, when I worked with the comprehensive data from my solar system, I bumped into the tool's limitations. (Check out the companion article for more details about the scope of the example I was running)
The larger the dataset, especially when reaching sizes of around 1GB, the more the tool struggled. I encountered out-of-memory errors, which were like hitting a wall in the middle of an analysis sprint. This limitation is something to be mindful of, particularly if you're planning to work with particularly large datasets.
Another area that presented a challenge was the tool's chat-based interface in the context of complex data analysis. While the capabilities of the approach are innovative, it sometimes feels like trying to navigate a labyrinth without a map. Keeping track of the analysis flow within the confines of a chat interface was quite tricky, especially during necessary data explorations.
It is clear that there is a big opportunity to improve the user experience, especially by building and leveraging analysis context (model, objects, functions, ...)
At the beginning of this article, I mentioned that it sometimes felt like using Photoshop to edit photos, but without the ability to click a button or select an area to edit. ADA experience is undoubtedly innovative, but to make it truly analytics-ready it should consider a more tailored and enhanced experience for the target persona with at least model context, object management, and other types of features helping to manage analytics activity beyond one-off processing and visualization tasks.
Switching gears. The tool's sandbox environment, while a very powerful and important capability, had its moments of instability. This manifested in occasional connection errors, timeouts, and even environment resets, which could disrupt the flow of work.
In summary, while the ChatGPT ADA tool opens up new horizons in data analysis, it's important to be aware of these quirks and limitations. Understanding these aspects helps in setting realistic expectations and planning your data analysis projects accordingly.
General Use & Larger Implications
In reflecting on the broader use and future implications of the ChatGPT ADA tool, it’s clear that its impact goes beyond just individual data projects. The tool’s capabilities and limitations offer insights into the evolving landscape of data analysis and the potential trajectory of AI-assisted tools.
Redefining Data Analysis Workflows: The ADA tool also hints at a future where traditional data analysis workflows are transformed. By automating and simplifying complex tasks, it allows users to focus more on strategic thinking and less on the technical minutiae of data manipulation. This shift could lead to more efficient and creative approaches to data analysis, where the emphasis is on insight and innovation rather than process and procedure.
Challenges and Adaptation: The tool's limitations with handling large datasets and its sometimes cumbersome chat interface highlight areas where AI-assisted tools still need to evolve. These challenges remind us that while the future of data analysis is bright, it is still a work in progress.
Implications for Professional Data Analysts: For seasoned data professionals, tools like (A)DA might not yet be able to immediately impact their current toolset as the tool lacks some basic features to manage professional workflow. However, it gives a glimpse and an indication of a future showing how expertise can be enhanced and provide a supplementary layer of efficiency and creativity, enabling analysts to tackle more complex problems and deliver deeper insights.
In conclusion, the ChatGPT ADA tool is not just a standalone solution but a signpost to the future of data analysis. It underscores the potential of AI in augmenting human capabilities and the ongoing need for balance and collaboration between human intelligence and machine efficiency.
Takeaways
Reflecting on my experience with the ChatGPT (A)DA tool, several key takeaways stand out:
Links to the other articles from the series:
#ai, #artificialintelligence, #dataanalysis, #datapatterns, #genai, #datavisualization, #chatgpt, #dataengineering