Introduction to Python Polars ????: A High-Efficiency DataFrames Built to Scale
Eduardo Miranda
Empreendedor, autor e professor. Siga para postagens sobre tecnologia, IA e minha jornada de aprendizado.
Polars efficiently handles millions of rows, making Python codes simpler and cleaner. In terms of speed, Polars is not just quick; it's incredibly fast.
To explore the full details and practical examples, we highly recommend reading the entire article here.
This article is meant for those who are already familiar with using pandas??? and are curious about whether polars ???? could be a good addition to their workflow. If you are not yet familiar with pandas, we highly recommend starting with the article Working with Data in Python: From Basics to Advanced Techniques to gain a foundational understanding of pandas.
Today, there are plenty of libraries in Python to deal with the data, and pandas is the most commonly used one.
Over the years, pandas has established itself as the go-to tool for data analysis in Python. The project, initiated by Wes McKinney in 2008, reached its major milestone with the 1.0 release in January 2020. Since then, it has remained a staple in the data analysis community and shows no signs of fading.
Despite its popularity, pandas is not without its flaws. Wes McKinney (pandas creator) has highlighted several of these challenges, and a significant number of online critiques generally focus on two main issues:
In an effort to address these shortcomings, Richie Vink developed Polars ????. In a detailed 2021 blog post, Vink presented metrics that substantiate his claims regarding Polars' improved performance and its more efficient design.
In this article, we will talk about what Polars is, some of its functionalities, and a practical use case where Polars performs outstandingly.
Why Polars ?????
As data size increases and the speed becomes a major factor, new libraries like Polars appear to replace previous ones with the new improved speed. Polars is an exceptionally fast DataFrame library designed for handling structured data. Its core is developed in Rust and is accessible for Python, R, and NodeJS users.
Polars offers several benefits that make it an attractive choice for data manipulation and analysis:
Apache Arrow establishes a columnar memory format that is platform-agnostic, catering to both flat and hierarchical data structures. This format is optimized for efficient analytical processing on contemporary hardware, including both CPUs and GPUs. Additionally, the Arrow memory format allows for zero-copy reads, enabling extremely fast data access without the burden of serialization overhead.
Single Instruction Multiple Data (SIMD) is an advanced microarchitecture method used in processors. This technique allows one instruction to simultaneously perform an operation on multiple data points. For example, it can multiply several numbers in just one clock cycle of the processor.
Basic Usage
To begin using Polars, you'll need to install it. This can be done easily with pip:
# Running the following line will install the 'polars' library
!pip install polars
Once installed, you can start using Polars just like any other DataFrame library.
# Import the polars library as pl to handle data frames efficiently
import polars as pl
Here's a simple example to demonstrate Polars' basic functionality.
# Create a DataFrame using polars (similar to pandas, but optimized for performance)
# The DataFrame contains three columns: 'name', 'age', and 'salary'
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 70000]
})
# Print the initial DataFrame to the console for visualization
print("Initial DataFrame:")
print(df)
# Filter the DataFrame to include only rows where the 'age' column is greater than 28
# This is achieved using the filter method and the col function from polars to select the 'age' column
filtered_df = df.filter(pl.col("age") > 28)
# Print the filtered DataFrame to show only those entries where age > 28
print("\nFiltered DataFrame (age > 28):")
print(filtered_df)
# Group the DataFrame by the 'age' column and aggregate the 'salary' column
# Specifically, calculate the sum of the 'salary' for each unique age
# The agg method is used for aggregation, and alias is used to rename the resulting column to 'total_salary'
grouped_df = df.groupby("age").agg([pl.sum("salary").alias("total_salary")])
# Print the grouped DataFrame to display the total salary for each age group
print("\nGrouped DataFrame (total salary by age):")
print(grouped_df)
Deep Dive into Functions
Let’s explore some of Polars' advanced functionalities through examples:
1. Lazy Execution
Lazy execution allows you to declare a series of transformations and execute them all at once. This can significantly improve performance for complex workflows.
# Convert the DataFrame to a LazyFrame. LazyFrames allow you to build up
# a query (series of transformations) without executing them immediately.
# This can optimize performance by combining operations and reducing
# multiple scans through your data.
lf = df.lazy()
# Declare transformations on the LazyFrame.
# Transformation 1: Filter the rows where the 'age' column is greater than 28.
# Transformation 2: Group the filtered data by the 'age' column.
# Transformation 3: Aggregate the group by summing the 'salary' column and renaming the result to 'total_salary'.
lazy_result = lf.filter(pl.col("age") > 28).groupby("age").agg([ pl.sum("salary").alias("total_salary")])
# Execute transformations.
# The collect() method triggers the execution of the query built so far in the LazyFrame.
# This reads the data, applies the filter, groupby, and aggregation, and returns a conventional DataFrame.
result = lazy_result.collect()
# Print the result.
print("Lazy Execution Result:")
print(result)
?
2. Parallel Execution
Polars can automatically parallelize operations to take full advantage of multicore processors.
# Create a large DataFrame ('df_large') with a single column named 'num'.
# The column 'num' is populated with integers ranging from 1 to 1,000,000.
# The 'list(range(1, 1000001))' generates a list starting from 1 up to and including 1,000,000.
df_large = pl.DataFrame({"num": list(range(1, 1000001))})
# Apply a transformation to the DataFrame using Polars' select method.
# The pl.col("num") references the 'num' column of the DataFrame.
# The '*' operator doubles each value in the 'num' column, effectively creating a new column with these transformed values.
# Polars uses lazy execution and multi-threading under the hood, which can help speed up operations on large datasets.
parallel_result = df_large.select( pl.col("num") * 2)
# Print the transformed DataFrame 'parallel_result', which contains the doubled values of the 'num' column.
print("Parallel Execution:")
print(parallel_result)
Real-World Use Case: Financial Data Analysis
Let's simulate a real-world use case where we analyze a large dataset of stock prices to find trends and calculate moving averages.
# Step 1: Data Loading
# Load the stock price data from the given URL and read it into a DataFrame using Polars
df = pl.read_csv("https://infinitepy.s3.amazonaws.com/samples/stock_price.csv")
# Print the initial data to get a quick look at the first few rows
print("Initial Data:")
print(df.head()) # head() method shows the first 5 rows by default
# Step 2: Data Cleaning
# Remove rows with any null (missing) values and store the cleaned DataFrame
df_clean = df.drop_nulls()
# Print the cleaned data to inspect the first few rows after removing null values
print("\nCleaned Data:")
print(df_clean.head())
# Step 3: Calculate moving averages
# Calculate 7-day and 30-day moving averages for the 'Price' column
# The with_columns method is used to add new columns to the DataFrame
df_clean = df_clean.with_columns([
# Calculate 7-day moving average of 'Price' and name the resulting column '7_day_ma'
pl.col("Price").rolling_mean(window_size=7).alias("7_day_ma"),
# Calculate 30-day moving average of 'Price' and name the resulting column '30_day_ma'
pl.col("Price").rolling_mean(window_size=30).alias("30_day_ma")
])
# Print the first 40 rows of the data to see the moving averages
print("\nData with Moving Averages:")
print(df_clean.head(40)) # head(40) shows the first 40 rows
# Step 4: Find Crossovers
# Define the condition for crossovers: when the 7-day moving average crosses above the 30-day moving average
# Use the filter method to apply this condition
crossovers = df_clean.filter(
(pl.col("7_day_ma") > pl.col("30_day_ma")) & # Current condition where 7-day MA is greater than 30-day MA
(pl.col("7_day_ma").shift(1) <= pl.col("30_day_ma").shift(1)) # Previous condition where 7-day MA was less than or equal to 30-day MA
# shift(1) looks at the previous row; this helps to detect the crossover point
)
# Print the rows where crossovers are detected
print("\nCrossovers:")
print(crossovers)
Conclusion
Polars is a powerful DataFrame library that offers significant performance advantages over traditional libraries like pandas. Its ability to handle large datasets efficiently and its emphasis on speed and memory usage make it an excellent choice for data-intensive applications. Whether you're dealing with financial data, time series, or large-scale data analytics, Polars can help you achieve faster and more efficient results.
By incorporating Polars into your data analysis workflows, you can take full advantage of modern hardware capabilities and achieve better performance, giving you more time to focus on deriving insights from your data rather than worrying about execution speed.
?? Subscribe to the InfinitePy Newsletter for more resources and a step-by-step approach to learning Python, and stay up to date with the latest trends and practical tips.
InfinitePy Newsletter - Your source for Python learning and inspiration.
To explore the full details and practical examples, we highly recommend reading the entire article here.