Tesco delivery slots login.REGISTER NOW GET FREE 888 PESOS REWARDS!

Polars efficiently handles millions of rows, making Python codes simpler and cleaner. In terms of speed, Polars is not just quick; it's incredibly fast.

To explore the full details and practical examples, we highly recommend reading the entire article here.

This article is meant for those who are already familiar with using pandas??? and are curious about whether polars ???? could be a good addition to their workflow. If you are not yet familiar with pandas, we highly recommend starting with the article Working with Data in Python: From Basics to Advanced Techniques to gain a foundational understanding of pandas.

Today, there are plenty of libraries in Python to deal with the data, and pandas is the most commonly used one.

Over the years, pandas has established itself as the go-to tool for data analysis in Python. The project, initiated by Wes McKinney in 2008, reached its major milestone with the 1.0 release in January 2020. Since then, it has remained a staple in the data analysis community and shows no signs of fading.

Despite its popularity, pandas is not without its flaws. Wes McKinney (pandas creator) has highlighted several of these challenges, and a significant number of online critiques generally focus on two main issues:

performance limitations and
a sometimes awkward or complex API.

In an effort to address these shortcomings, Richie Vink developed Polars ????. In a detailed 2021 blog post, Vink presented metrics that substantiate his claims regarding Polars' improved performance and its more efficient design.

In this article, we will talk about what Polars is, some of its functionalities, and a practical use case where Polars performs outstandingly.

Why Polars ?????

As data size increases and the speed becomes a major factor, new libraries like Polars appear to replace previous ones with the new improved speed. Polars is an exceptionally fast DataFrame library designed for handling structured data. Its core is developed in Rust and is accessible for Python, R, and NodeJS users.

Polars offers several benefits that make it an attractive choice for data manipulation and analysis:

Speed: Built from the ground up in Rust, Polars is machine-close and free from external dependencies, ensuring high performance.
Versatile I/O: Supports various data storage systems, including local storage, cloud services, and databases.
User-Friendly API: Allows you to write queries naturally, with Polars' internal query optimizer figuring out the most efficient execution method.
Efficient Memory Usage: The streaming API processes results without loading all data into memory simultaneously.
Parallel Processing: Leverages all available CPU cores for workload distribution without needing extra configuration.
Optimized Query Engine: Uses Apache Arrow for columnar data processing and SIMD for peak CPU efficiency.

Apache Arrow establishes a columnar memory format that is platform-agnostic, catering to both flat and hierarchical data structures. This format is optimized for efficient analytical processing on contemporary hardware, including both CPUs and GPUs. Additionally, the Arrow memory format allows for zero-copy reads, enabling extremely fast data access without the burden of serialization overhead.

Single Instruction Multiple Data (SIMD) is an advanced microarchitecture method used in processors. This technique allows one instruction to simultaneously perform an operation on multiple data points. For example, it can multiply several numbers in just one clock cycle of the processor.

Basic Usage

To begin using Polars, you'll need to install it. This can be done easily with pip:

# Running the following line will install the 'polars' library
!pip install polars

Once installed, you can start using Polars just like any other DataFrame library.

# Import the polars library as pl to handle data frames efficiently
import polars as pl

Here's a simple example to demonstrate Polars' basic functionality.

# Create a DataFrame using polars (similar to pandas, but optimized for performance)
# The DataFrame contains three columns: 'name', 'age', and 'salary'
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 70000]
})

# Print the initial DataFrame to the console for visualization
print("Initial DataFrame:")
print(df)

# Filter the DataFrame to include only rows where the 'age' column is greater than 28
# This is achieved using the filter method and the col function from polars to select the 'age' column
filtered_df = df.filter(pl.col("age") > 28)

# Print the filtered DataFrame to show only those entries where age > 28
print("\nFiltered DataFrame (age > 28):")
print(filtered_df)

# Group the DataFrame by the 'age' column and aggregate the 'salary' column
# Specifically, calculate the sum of the 'salary' for each unique age
# The agg method is used for aggregation, and alias is used to rename the resulting column to 'total_salary'
grouped_df = df.groupby("age").agg([pl.sum("salary").alias("total_salary")])

# Print the grouped DataFrame to display the total salary for each age group
print("\nGrouped DataFrame (total salary by age):")
print(grouped_df)

Deep Dive into Functions

Let’s explore some of Polars' advanced functionalities through examples:

1. Lazy Execution

Lazy execution allows you to declare a series of transformations and execute them all at once. This can significantly improve performance for complex workflows.

# Convert the DataFrame to a LazyFrame. LazyFrames allow you to build up
# a query (series of transformations) without executing them immediately.
# This can optimize performance by combining operations and reducing
# multiple scans through your data.
lf = df.lazy()

# Declare transformations on the LazyFrame.
# Transformation 1: Filter the rows where the 'age' column is greater than 28.
# Transformation 2: Group the filtered data by the 'age' column.
# Transformation 3: Aggregate the group by summing the 'salary' column and renaming the result to 'total_salary'.
lazy_result = lf.filter(pl.col("age") > 28).groupby("age").agg([ pl.sum("salary").alias("total_salary")])

# Execute transformations.
# The collect() method triggers the execution of the query built so far in the LazyFrame.
# This reads the data, applies the filter, groupby, and aggregation, and returns a conventional DataFrame.
result = lazy_result.collect()

# Print the result.
print("Lazy Execution Result:")
print(result)

?

2. Parallel Execution

Polars can automatically parallelize operations to take full advantage of multicore processors.

# Create a large DataFrame ('df_large') with a single column named 'num'.
# The column 'num' is populated with integers ranging from 1 to 1,000,000.
# The 'list(range(1, 1000001))' generates a list starting from 1 up to and including 1,000,000.
df_large = pl.DataFrame({"num": list(range(1, 1000001))})

# Apply a transformation to the DataFrame using Polars' select method.
# The pl.col("num") references the 'num' column of the DataFrame.
# The '*' operator doubles each value in the 'num' column, effectively creating a new column with these transformed values.
# Polars uses lazy execution and multi-threading under the hood, which can help speed up operations on large datasets.
parallel_result = df_large.select( pl.col("num") * 2)

# Print the transformed DataFrame 'parallel_result', which contains the doubled values of the 'num' column.
print("Parallel Execution:")
print(parallel_result)

Real-World Use Case: Financial Data Analysis

Let's simulate a real-world use case where we analyze a large dataset of stock prices to find trends and calculate moving averages.

Data Loading: Load CSV file with historical stock prices.
Data Cleaning: Remove any rows with missing data.
Calculations: Calculate the 7-day and 30-day moving averages.
Analysis: Find dates where the 7-day moving average crosses above the 30-day moving average.

# Step 1: Data Loading

# Load the stock price data from the given URL and read it into a DataFrame using Polars
df = pl.read_csv("https://infinitepy.s3.amazonaws.com/samples/stock_price.csv")

# Print the initial data to get a quick look at the first few rows
print("Initial Data:")
print(df.head())  # head() method shows the first 5 rows by default


# Step 2: Data Cleaning

# Remove rows with any null (missing) values and store the cleaned DataFrame
df_clean = df.drop_nulls()

# Print the cleaned data to inspect the first few rows after removing null values
print("\nCleaned Data:")
print(df_clean.head())


# Step 3: Calculate moving averages

# Calculate 7-day and 30-day moving averages for the 'Price' column

# The with_columns method is used to add new columns to the DataFrame
df_clean = df_clean.with_columns([
    # Calculate 7-day moving average of 'Price' and name the resulting column '7_day_ma'
    pl.col("Price").rolling_mean(window_size=7).alias("7_day_ma"),

    # Calculate 30-day moving average of 'Price' and name the resulting column '30_day_ma'
    pl.col("Price").rolling_mean(window_size=30).alias("30_day_ma")
])

# Print the first 40 rows of the data to see the moving averages
print("\nData with Moving Averages:")
print(df_clean.head(40))  # head(40) shows the first 40 rows


# Step 4: Find Crossovers

# Define the condition for crossovers: when the 7-day moving average crosses above the 30-day moving average
# Use the filter method to apply this condition
crossovers = df_clean.filter(
    (pl.col("7_day_ma") > pl.col("30_day_ma")) &  # Current condition where 7-day MA is greater than 30-day MA
    (pl.col("7_day_ma").shift(1) <= pl.col("30_day_ma").shift(1))  # Previous condition where 7-day MA was less than or equal to 30-day MA
    # shift(1) looks at the previous row; this helps to detect the crossover point
)

# Print the rows where crossovers are detected
print("\nCrossovers:")
print(crossovers)

Conclusion

Polars is a powerful DataFrame library that offers significant performance advantages over traditional libraries like pandas. Its ability to handle large datasets efficiently and its emphasis on speed and memory usage make it an excellent choice for data-intensive applications. Whether you're dealing with financial data, time series, or large-scale data analytics, Polars can help you achieve faster and more efficient results.

By incorporating Polars into your data analysis workflows, you can take full advantage of modern hardware capabilities and achieve better performance, giving you more time to focus on deriving insights from your data rather than worrying about execution speed.

?? Subscribe to the InfinitePy Newsletter for more resources and a step-by-step approach to learning Python, and stay up to date with the latest trends and practical tips.

InfinitePy Newsletter - Your source for Python learning and inspiration.

To explore the full details and practical examples, we highly recommend reading the entire article here.

Introduction to Python Polars ????: A High-Efficiency DataFrames Built to Scale

Eduardo Miranda

Empreendedor, autor e professor. Siga para postagens sobre tecnologia, IA e minha jornada de aprendizado.

Polars efficiently handles millions of rows, making Python codes simpler and cleaner. In terms of speed, Polars is not just quick; it's incredibly fast.

Why Polars ?????

Basic Usage

Deep Dive into Functions

1. Lazy Execution

2. Parallel Execution

Real-World Use Case: Financial Data Analysis

Conclusion

InfinitePy Newsletter ????

3,259 位关注者

更多精彩文章

社区洞察

Polars efficiently handles millions of rows, making Python codes simpler and cleaner. In terms of speed, Polars is not just quick; it's incredibly fast.

Why Polars ?????

Basic Usage

Deep Dive into Functions

1. Lazy Execution

2. Parallel Execution

Real-World Use Case: Financial Data Analysis

Conclusion

InfinitePy Newsletter ????

3,259 位关注者

Otimizando o desempenho no PySpark com com arquivos Parquet - Parte II

2024年10月21日

Principais transformac?o?es e ac?o?es disponi?veis no Apache Spark DataFrame: Uma Visa?o Geral com Exemplos Pra?ticos

2024年10月4日

Getting started with PySpark on Google Colab

2024年8月30日

Introduc?a?o ao PySpark no Google Colab

2024年8月26日

PySpark Introduction: Powering Big Data Processing with Apache Spark

2024年8月20日

Introdu??o ao PySpark: potencializando o processamento de Big Data com Apache Spark

2024年8月20日

Understanding the Speed and Efficiency of Polars

2024年8月9日

Introdu??o ao Python Polars: Uma rápida biblioteca de DataFrame

2024年8月5日

Integrando Python Pandas com ChatGPT: Uma nova fronteira

2024年7月29日

Integrating Python Pandas with ChatGPT: A new frontier

2024年7月25日

社区洞察