Integrating Python Pandas with ChatGPT: A new frontier

Integrating Python Pandas with ChatGPT: A new frontier

To explore the full details and practical examples, we highly recommend reading the entire article here.

Utilizing high-caliber Python libraries like Pandas and integrating them with powerful tools such as ChatGPT can substantially enhance productivity and streamline the process of extracting valuable insights from organizational data assets.

Over the years, I have been teaching Python and specifically focusing on Pandas. Recently, I discovered an exciting new project called PandasAI. This innovative library allows you to write queries in plain English, which are then automatically translated into Pandas commands in Python. I have experimented with PandasAI and uncovered some interesting results.

PandasAI is great for data scientists, analysts, and engineers who’d like to engage with data less artificially. This is especially helpful if you are beginners and do not understand how SQL or Python works or if you are able to spend weeks doing data manipulation. The natural language querying efficiency is also added for experienced professionals can also find it useful.

What is PandasAI?

PandasAI is open source Python software for labeled data, which allows to query your data with natural language. Not only does it translate your questions into python pandas code but also it has several others features that make the data analysis more efficient and also more complete.

Key Features of PandasAI

  • Natural Language Querying: Enables you to make questions in simple words without high-level terminologies.
  • Data Visualization: Simple and effective tools that are useful in the generation of graphs and charts for the clarification of the data you are dealing with.
  • Data Cleansing: Is useful especially when dealing with datasets with missing values.
  • Feature Generation: Improves the quality of the data you enter by creating new variables.
  • Data Connectors: There is a connection to different data such as CSV, XLSX, PostgreSQL, MySQL, BigQuery, Databricks, Snowflake, etc.

How it Works

PandasAI uses a generative AI type that discerns the meaning of a user’s question and translates it into Python code and SQL queries. Your actual data then engages with this code and gives you the results that you want.


Project Work & Practice

The datasets and query examples presented here were originally featured in the article, "Working with Data in Python: From Basics to Advanced Techniques. " If you are not yet comfortable with Python Pandas, I highly recommend reviewing that article.

# Running the following line will install the 'pandasai' library
!pip install pandasai        
# Import the pandas library, commonly used for data manipulation and analysis
import pandas as pd

# Import the pyplot module from the matplotlib library, used for plotting and visualization
import matplotlib.pyplot as plt

# Import the Agent class from the pandasai module, which can interface with large language models (LLMs)
from pandasai import Agent

# Import the SmartDataframe class from the pandasai module, which adds enhanced data manipulation features
from pandasai import SmartDataframe

# Import the OpenAI class within the pandasai.llm.openai submodule, which allows interfacing with OpenAI's language models
from pandasai.llm.openai import OpenAI        

Once you’ve completed the installation, we can begin our PandasAI experiment.


?? Creating a SmartDataframe

SmartDataframe is a class from the pandasai.llm.openai library, which is part of the Pandas AI ecosystem. This particular class leverages the capabilities of Large Language Models (LLMs), like OpenAI's GPT, to perform advanced data frame operations that typically require a deep understanding of the data context or might be too complex to implement using traditional methods.This allows you to perform complex operations and queries on your data frames using natural language commands.

In case you don't have an OpenAI API key yet, in this article, we will guide you through the process of generating your OpenAI API key for ChatGPT .

Now that we have everything set up, let's begin our comparison between the traditional way of working with Python Pandas ?? and the innovative approach of Natural Language Querying with PandasAI ??.


Find the top 5 products by sales

?? Traditional method

# Group the sales data by 'Product' and sum the 'Sales' for each product.
top_products_sales = sales_df.groupby('Product')['Sales'].sum()

# Sort the summed sales in descending order to get the best selling products at the top.
top_products_sales_sorted = top_products_sales.sort_values(ascending=False)

# Select the top 5 products with the highest sales.
top_5_products_sales_sorted = top_products_sales_sorted.head(5)

# Print the result to see the names and sales of the top 5 products.
top_5_products_sales_sorted        


?? Natural Language Querying with PandasAI

sales_smart_df.chat("Find the top 5 products by sales")        


Top 5 products by sales


Calculating the sales trend on a monthly basis for each region.

?? Traditional method

import pandas as pd
import matplotlib.pyplot as plt

# Drop duplicates and forward fill missing values
sales_df.drop_duplicates(inplace=True)
sales_df.fillna(method='ffill', inplace=True)

# Define the correct chronological order for months
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September','October', 'November', 'December']

# Convert the 'Month' column to a categorical type with the specified order
sales_df['Month'] = pd.Categorical(sales_df['Month'], categories=month_order, ordered=True)

# Pivot table to calculate total sales per month for each region
monthly_sales = sales_df.pivot_table(index='Month', columns='Region', values='Sales', aggfunc='sum')

# # Filter the DataFrame to include only data from January to June
filtered_monthly_sales_df = monthly_sales.loc[ 'January': 'June']

# Plotting the monthly sales trend for each region
filtered_monthly_sales_df.plot(kind='line', marker='o', figsize=(12, 8))
plt.title('Monthly Sales Trend per Region (January to June)')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend(title='Region')
plt.xticks(rotation=45)
plt.show()        


?? Natural Language Querying with PandasAI

sales_smart_df.chat("Calculates the sales trend on a monthly basis for each region.")        
Monthly Sales Trend per Region.


Sales Share by Product

?? Traditional method

# Compute the sales share for each product
sales_share = sales_df.groupby('Product')['Sales'].sum() / sales_df['Sales'].sum() * 100

# Plotting the sales share as a pie chart
sales_share.plot(kind='pie', autopct='%1.1f%%', figsize=(8, 8))
plt.title('Sales Share by Product')
plt.ylabel('')  # Hide the y-label
plt.show()        


?? Natural Language Querying with PandasAI

sales_smart_df.chat("Calculates the Sales Share by Product and create a pie chart")        
Sales Share by Product.


Conclusion

Combining Pandas with ChatGPT allows anyone to build efficient data processing solutions based on the language interface that simplifies machine learning computations. We have gone all the way from basic data loading with Pandas to complex data analysis with ChatGPT, thus addressing the novices as well as the experts among the learners. This guide will help you to effectively manage and get the best out of both kinds of technologies.

Integrating these tools in practice shows great promise. FAC members and developers can gain deeper insights and access more context-rich information from the work of data analysts more quickly. Continue experimenting with various datasets and prompts to unlock the full potential of these combined tools.

?? Subscribe to the InfinitePy Newsletter for more resources and a step-by-step approach to learning Python, and stay up to date with the latest trends and practical tips.

InfinitePy Newsletter - Your source for Python learning and inspiration.


To explore the full details and practical examples, we highly recommend reading the entire article here .

要查看或添加评论,请登录

社区洞察

其他会员也浏览了