Introduction To PandasAI Part 1
Introduction:
Currently, we spend a lot of time editing, cleaning, and analyzing data using various methodologies.
Pandas is a popular Python library through which we can do the data manipulation.
Data Manipulation using Python means – It keeps the data in a structured format that is known as a "Data frame”. Data frames allow us to alter, clean up, or analyze the data. We can analyze the data by generating a bar graph, adding a new row/column, or by replacing the missing data.
To perform this operation of data manipulation it takes a lot of time and it’s a time-consuming process.
To overcome this drawback, we have the PandasAI library, a pandas library extension is more efficient for data analysis and manipulation.
Advantages of PandasAI:
What is PandasAI?
How does PandasAI work?
It uses a generative AI model to understand and interpret Natural Language queries and translate them into Python code and SQL queries. It uses the code to interact with the data and return results to the users.
?
Who can use PandasAI?
It is designed to interact with the data in a more interactive way and it is used or designed for the Data Scientists, Data Analysts and Data Engineers who want to interact with their data in a more natural way.
It is most useful for those who are not familiar with SQL or Python or who want to save time and effort while working with data.
It is also useful for those who are familiar with Python and SQL as it allows them to ask questions about their data without having to write any complex code.
?
To start with Pandas AI:
?Step1:
First install PandasAI
pip install pandasai
Step2:
After installing PandasAI, we can start using it by importing the SmartDataframe class and instantiating it with the data.
from pandasai import SmartDataframe
Step 3:
Load the Dataset into a Data Frame using a dictionary.
import pandas as pd
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada",
"Australia", "Japan", "China",],
"gdp": [
19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416,
1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064, ],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12],
})
df.head()
df.shape
?Step 4:
Initialize an OpenAl Large- Language Model (LLM)
Since PandasAI works on OpenAI LLM, we need to store OpenAI API key in the environment using the following code:
from pandasai.llm import OpenAI
llm = OpenAI(api_token="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")?
Step 5:
Provide a text prompt and a DataFrame to PandaAI.
We can then use the chat() method to ask the question in natural language.
?
Pondering what is SmartDataframe? Read this ????
A smartDataframe is a pandas (or polars) dataframe that inherits all the properties and methods from the pd.DataFrame, but also adds conversational features to it.
Now that we have instantiated the LLM, we can finally instantiate the SmartDataframe.
?
sdf = SmartDataframe(df, config={"llm": llm})
sdf.chat("Return the top 5 countries by GDP")
sdf.chat("What's the sum of the gdp of the 2 unhappiest countries?")
print(sdf.last_code_generated)
?you can also use PandasAI to easily plot a chart.
领英推荐
sdf.chat("Plot a chart of the gdp by country")
sdf.chat("Plot a histogram of the gdp by country, using a different color for each bar")
What if we want to work with multiple DataFrames at a time?
If we want to work with multiple dataframes at a time, then we can use the SmartDataLake instead of SmartDataFrame.
Let us understand what is SmartDataLake.
The concept is very similar to the SmartDataframe, but instead of accepting only 1 df as input, it can accept multiple dfs.
Syntax?????
from pandasai import SmartDatalake
?
Below we are joining 2 different DataFrames, which will make a DataLake.
employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Sales", "IT", "Marketing", "Finance"],
}
)
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
lake = SmartDatalake(
[employees_df, salaries_df],
config={"llm": llm}
)
lake.chat("Who gets paid the most?")
Examples :
import pandas as pd
df = pd.DataFrame({
"country": [
"United States",
"United Kingdom",
"France",
"Germany",
"Italy",
"Spain",
"Canada",
"Australia",
"Japan",
"China",
],
"gdp": [
19294482071552,
2891615567872,
2411255037952,
3435817336832,
1745433788416,
1181205135360,
1607402389504,
1490967855104,
4380756541440,
14631844184064,
],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12],
})
from pandasai.llm import OpenAI
llm = OpenAI(api_token=" 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")
sdf = SmartDataframe(df, config={"llm": llm})
sdf.chat("Return the top 5 countries by GDP")
sdf.chat("What's the sum of the gdp of the 2 unhappiest countries?")
sdf.chat("Plot a chart of the gdp by country")
sdf.chat("Plot a histogram of the gdp by country, using a different color for each bar")
from pandasai import SmartDatalake
employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Sales", "IT", "Marketing", "Finance"],
}
)
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
lake = SmartDatalake(
[employees_df, salaries_df],
config={"llm": llm}
)
lake.chat("Who gets paid the most?")hat("Plot a histogram of the gdp by country, using a different color for each bar")
from pandasai import SmartDatalake
employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Sales", "IT", "Marketing", "Finance"],
}
)
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
lake = SmartDatalake(
[employees_df, salaries_df],
config={"llm": llm}
)
lake.chat("Who gets paid the most?")
>> The employee who gets paid the most is Olivia.
users_df = pd.DataFrame(
{
"id": [1, 2, 3, 4, 5],
"name": ["John", "Emma", "Liam", "Olivia", "William"]
}
)
users = SmartDataframe(users_df, name="users")
photos_df = pd.DataFrame(
{
"id": [31, 32, 33, 34, 35],
"user_id": [1, 1, 2, 4, 5]
}
)
photos = SmartDataframe(photos_df, name="photos")
lake = SmartDatalake([users, photos], config={"llm": llm})
lake.chat("How many photos has been uploaded by John?")
>> 2
Embark on a transformative learning journey with Gamaka AI. Elevate your skills through engaging courses designed to make every concept relatable and applicable, ensuring you not only master the data but also connect with the story it tells. Join us and empower your future with the knowledge that goes beyond numbers. Check out our courses here: https://gamakaai.com/.
?
References:
?????
?
?
?