PandasAI: Shaping the Future of Conversational Data Analysis
Benjamin Wolba
eurodefense.tech |?Fostering Defense Innovation for European Sovereignty | Blogging at future-of-computing.com
Over the past two years, large language models have revolutionized how we compute answers to questions in plain English.
They have also paved the way for AI agents—(semi-)autonomous software systems capable of making decisions based on input data and performing actions to achieve specific goals. Beyond crafting compelling marketing copy, AI agents have also proven invaluable not only for boosting developers’ productivity but also for helping non-technical people do things without knowing how to code.
PandasAI was founded by Gabriele Venturi in 2023 to boost data scientists’ productivity and help their business peers get valuable insights from company data autonomously. It raised a $1.1M Pre-Seed round from Runa Capital, Episode1 Ventures, and Vento in the fall of 2023 and was part of the Y Combinator Winter 2024 Batch.
Learn more about the future of conversational data analysis from our interview with the co-founder and CEO, Gabriele Venturi:?
Why Did You Start PandasAI?
In my previous entrepreneurial endeavors, I encountered the struggle of handling large amounts of data and analyzing them. Data analysis is crucial for businesses to gain insights, yet it takes a fair share of effort and specialist knowledge of Python libraries like pandas.?
After ChatGPT came out, we witnessed the first wave of generative AI startups building products, e.g., for writing marketing copy. Many seemed to lack defensibility and a clear moat differentiating them from ChatGPT. But I clearly felt something big was coming, and this was just the beginning.?
That’s when I started exploring how large language models could power AI agents to take care of specific tasks in data analysis and building data pipelines specifically for financial institutions. As I was talking to customers in the financial industry, I figured that while they were interested, sales cycles for a closed product would be long and unsustainable, so I decided to open source the core engine I had already built on GitHub.?
Within a matter of days, the GitHub project gathered 1000 stars, and then it went viral and reached 5000 stars in the first two weeks. The project clearly struck a nerve, and the branding helped a lot, using one of the most popular Python libraries for data analysis in its name and some eye-catching generative AI art for updates on social media. It got covered by several AI newsletters, and the number of users grew exponentially in the first months, so the resonance was clear, and I decided to found PandasAI as a startup and go all in.?
How Does PandasAI Help Data Analysts?
PandasAI allows users to describe what they want to do with pandas in plain English. They don’t have to learn specific Python libraries like pandas, and they may not be proficient in Python. We allow them to do data analysis conversationally.?
We prompt large language models with the user’s input to generate the necessary pandas code to achieve the user’s objective. In that regard, we’re model-agnostic and can work with any sufficiently powerful model that is strong in reasoning and writing code. We then ensure the model’s answer is accurate and the code works. If not, we have a fallback to try again so that the model gets it right.?
Data scientists typically work in computational notebooks, like Jupyter notebooks, and PandasAI can integrate right with their existing workflows and notebooks, so for them, everything else stays the same.?
How Do You Pick Large Language Models?
There are so many large language models out there that choosing one can be hard. But generally, I’d say that GPT-3.5 is the threshold, and any model performing below is too simplistic and can’t handle complex queries properly.?
领英推荐
Most state-of-the-art open-source models are good enough—they’re still far from perfect but sufficient for common use cases. Giant state-of-the-art models like GPT-4o are still slower and much more expensive than smaller models, and the improvements have been incremental. So we’re working with a range of models from OpenAI, Claude, or Google and will train our own GPT model.?
As a first step, we’re fine-tuning models, which already improves their performance for specific use cases. We’re also thinking of building our own tokenizer and, down the road, our own foundational model. It won’t be large, maybe just a billion parameters, but it will be specifically built to handle pandas queries.?
When Does Fine-Tuning a Model Make Sense Over Prompt Engineering?
Generally speaking, fine-tuning makes sense because the model will have learned to perform specific tasks, so you don’t have to instruct it, and you can fit your prompts into a smaller context window. This is the biggest advantage of fine-tuning over prompt engineering, saving time and money.
Ultimately, a model is just one component of the product, not the product itself. Defensibility comes mainly by building a superb product around the model and having access to unique, high-quality data to fine-tune the model. High data quality is super important—low-quality data could make the fine-tuned model even worse. Yet, creating high-quality data is hard: Synthetic data often comes with biases and may lead the model to overfit to do things a certain way over another, while curating data manually is a lot of effort and not scalable.
How Did You Evaluate Your Startup Idea?
One key question is how to define the threshold between what should be open-source and freely available and what should be our core product offering and closed-source. I’m generally in favor of keeping that threshold as high as possible. Not only because it’s good for growth and marketing but also because I believe in open source and that you should never change licenses and keep open-source software open.?
PandasAI's value proposition is to make data scientists and their non-technical peers more productive. Features focusing on enterprise teams and multiple users don’t have to be open-source, and a single feature could give value to millions of potential enterprise users. Like Databricks did with Spark, the product specifically addresses enterprises' needs, including role and permission management, single-sign-on, and collaborative features.?
We’ve already been in touch with Fortune 500 companies, so we will focus on working with large customers and meaningful enterprise contracts. At the same time, we’re building an ecosystem with PandasAI, where we can add new building blocks and find opportunities to upsell over time.?
What’s Your Product Strategy For The Future?
Going forward, we want to empower also non-technical and business people to analyze data. At the moment, there’s typically a lot of back-and-forth between business and data analysis people on a team, where many of the questions can be automatically answered with PandasAI. So, data scientists can focus on complex and difficult questions rather than being bogged down by many simple requests.?
Our long-term vision is to give our users insights directly from their data and answer ‘why questions’ like, ' Why is our churn increasing?' PandasAI will be able to provide you with useful insights, even if you didn’t think of asking for them.??
So, we have two stakeholders: the end user, who may be a businessperson, and the data scientist, who saves time and effort as we empower non-technical peers with a self-service solution. Talking to the Head of Data proved to be a good entry point for landing pilot projects and eventually selling them to enterprises.?
What Advice Would You Give Fellow Deep Tech Founders?
Don’t be a solo founder. Even if you think you could move 10x faster on your own, and having a co-founder would slow you down, in the medium term, the entire burden of running a startup will be on you, which may not be sustainable. Startups are a roller coaster, and your chances of success are much higher if there are two of you—especially if you’re a first-time founder.