Manipulating Parquet Files with Python Scripts
Parquet has become a popular choice for data storage and processing in big data ecosystems due to its efficiency and compatibility with various tools and platforms. Working with Parquet files can feel a bit challenging at first. This article will share some practical tools and tips to help you handle Parquet files, address common use cases, and boost your productivity. Specifically, it includes small python scripts as examples that can help you work with Parquet files, which should be useful for creating test files for both unit and integration tests.
View Parquet Files
Parquet stores data in a binary format instead of plain text, which enables better compression and faster read/write operations compared to text-based formats like CSV or JSON. However, this also means that viewing and editing Parquet files is more challenging than working with formats like CSV and JSON. Fortunately, there are specific tools available that can help you view and work with Parquet files.
In VSCode
If you are into using this lightweight yet powerful code editor Visual Studio Code aka VSCode from Microsoft, the good news is that there are a few extension available which can be used to view parquet files. Image 1 shows some available options, however, my personal favourite is parquet-viewer which helps you to view parquet files as JSON. So far, it has been successful in handling quite complex schema structures as well.
In IntelliJ IDEA
There are some plugins available making it easier to inspect and analyse Parquet files without leaving one of the most popular and favoured integrated development environments (IDEs) among developers, particularly for Java development
You can install “Avro and Parquet Viewer” Plugin or “Big Data File Viewer” Plugin to view Parquet files in IntelliJ. “Big Data File Viewer” needs “Big Data Tools Core” plugin as well so that will be installed automatically when you install “Big Data File Viewer” Plugin. Both of these plugins support some other formats as well such as Avro.
In DuckDB
DuckDB is another favourite choice particularly by data analysts and data scientists due to its ease of use, efficiency in handling large datasets, and seamless integration with popular data processing libraries like Pandas in Python and dplyr in R. DuckDB is particularly useful for working with Parquet files due to its native support for this file format. It allows users to directly query Parquet files using SQL without needing to import data into a traditional database system first.
Editing and Manipulating Parquet Files
We’re now about to explore the most exciting part of our topic — editing and manipulating Parquet files. In this section, we’ll explore a few basic scenarios to create and modify Parquet files. These examples should cover most common use cases, but feel free to adjust them to meet your specific needs.
To manipulate Parquet files, we’ll use Python and tap into the following powerful libraries designed specifically for data manipulation and analysis.
If you haven’t already installed these libraries, you can do so using Python’s package installer, pip, with the following commands:
pip install pandas
pip install pyarrow
Scenario 1: Transitioning from CSV to Parquet
As a developer familiar with CSV and other text-based formats, my goal now is to replicate the same data structure in a Parquet file. Essentially, I want to create a Parquet file that maintains the same columns and data as its CSV counterpart.
In this specific example, I’ll start with a very modest CSV file that includes 4 columns and just 2 rows. Notably, the final column, labeled ‘articles’, contains a list of strings.
领英推荐
Input CSV File (example.csv):
group_id, group_name, quality_score, articles
g1234,Cosmetics,0.99,"['1234-6756', '1234-4675', '3243-6473']"
g5678,Grocery,0.85,"['3333-6756', '3333-4675', '4343-6473']"
Below is the Python script designed to convert the CSV file into a Parquet file containing the same data. It is assumed that example.csv is located in the same directory as this script. The script will also save the generated Parquet file output.parquet in the same location.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Read the CSV file
df = pd.read_csv("example.csv", converters={'articles': eval})
# Define schema for Parquet file to ensure data types are correct
schema = pa.schema([
('group_id', pa.string()),
('group_name', pa.string()),
('quality_score', pa.float64()),
('articles', pa.list_(pa.string()))
])
# Convert DataFrame to Table with schema
table = pa.Table.from_pandas(df, schema=schema)
# Write to a Parquet file
pq.write_table(table, 'output.parquet')
print("Parquet file created successfully.")
Here is the view of the parquet file displayed as json (generated by parquet-viewer tool of VSCode)
Output JSON equivalent of Parquet File (output.parquet):
{"group_id":"g1234","group_name":"Cosmetics","quality_score":0.99,"articles":["1234-6756","1234-4675","3243-6473"]}
{"group_id":"g5678","group_name":"Grocery","quality_score":0.85,"articles":["3333-6756","3333-4675","4343-6473"]}
Scenario 2: Managing Large Parquet Files for Testing
Parquet files in production environments are typically large because they’re designed to hold extensive datasets. This creates a challenge for developers who need to conduct unit tests that involve parsing and verifying specific data items within these files. Often, a unit test might only require a few records, necessitating the creation of smaller, more manageable Parquet files. However, crafting these files from scratch can be tricky — they might not fully replicate the schema or structure of the production Parquet files.
Script Overview
Here’s a Python script designed to handle this scenario. It reads a large Parquet file named large-parquet.parquet and splits it into two smaller files for more focused testing. The script extracts the first three records into one file and the following four records into another, facilitating targeted tests without the need for handling the full dataset.
This approach not only simplifies testing but also ensures that developers can work with data structures that closely mimic those in the production environment, albeit on a smaller scale.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Path to the large Parquet file
large_parquet_file_path = 'large-parquet.parquet'
# Read the large Parquet file
df = pd.read_parquet(large_parquet_file_path)
# Extract the first three rows for the first small file
df_first_three = df.head(3)
# Extract the next four rows for the second small file
df_next_four = df.iloc[4:7]
# Read the schema from the large Parquet file
original_table = pq.read_table(large_parquet_file_path)
schema = original_table.schema
# Convert DataFrame to PyArrow Table with the original schema
table_first_three = pa.Table.from_pandas(df_first_three, schema=schema)
table_next_four = pa.Table.from_pandas(df_next_four, schema=schema)
# Write the tables to new Parquet files
small_parquet_file_1_path = 'small-file-1.parquet'
small_parquet_file_2_path = 'small-file-2.parquet'
pq.write_table(table_first_three, small_parquet_file_1_path)
pq.write_table(table_next_four, small_parquet_file_2_path)
print(f"First small Parquet file created at {small_parquet_file_1_path}")
print(f"Second small Parquet file created at {small_parquet_file_2_path}")
Scenario 3: Modifying Data in Parquet Files for Compliance and Testing
After segmenting large production Parquet files into smaller chunks, we often encounter a critical challenge: the original data might contain sensitive or private information that shouldn’t be used directly in test environments due to privacy regulations like GDPR. Additionally, specific testing scenarios might require modifications to certain data fields to ensure the test’s effectiveness.
To address these issues, we may need to alter the data within the Parquet files to fit our specific needs. For instance, anonymising personal data or adjusting values to meet certain test conditions. The upcoming Python script demonstrates how to amend the data within a Parquet file. Specifically, it will modify the second entry in the ‘articles’ list of the first row from an ‘output.parquet’ file, showcasing how targeted data manipulations can be achieved for testing purposes.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Load the Parquet file
df = pd.read_parquet('output.parquet')
# Print the original DataFrame
print("Original DataFrame:")
print(df)
# Modify the second article in the first row's articles list
if len(df.at[0, 'articles']) >= 2:
df.at[0, 'articles'][1] = '3577-9898'
# Print the modified DataFrame
print("Modified DataFrame:")
print(df)
# Define the schema again if you want to ensure it remains consistent
schema = pa.schema([
('group_id', pa.string()),
('group_name', pa.string()),
('quality_score', pa.float64()),
('articles', pa.list_(pa.string()))
])
# Convert DataFrame to Table with the schema
table = pa.Table.from_pandas(df, schema=schema)
# Write back to Parquet
pq.write_table(table, 'modified_output.parquet')
print("Modification complete and file saved.")
Conclusion
In this article, we’ve explored how to work with Parquet files using Python, highlighting practical tools and techniques that can make handling these files easier and more efficient. By integrating Python with powerful libraries like Pandas and PyArrow, we can not only view but also modify Parquet files effectively.
We covered a variety of scenarios: from viewing Parquet files in popular IDEs like VSCode and IntelliJ, to slicing large datasets for testing, or adjusting data to meet privacy regulations. These examples help you understand the basics and inspire more advanced data management strategies.
Using Python for these tasks allows you to handle large amounts of data smoothly, making your projects more manageable. It also prepares you to tackle complex data challenges confidently, enhancing both the scalability and performance of your data solutions.
As you continue working with big data, keep exploring and using Python’s extensive toolkit. This approach will surely enrich your skills and lead to better, more efficient data handling strategies in your projects.
Naveed Chaudhry
You can also follow me on Medium: https://medium.com/@chaudhryn
Software Developer
2 周Great article. Love to see my vscode extension (parquet-visualizer) mentioned in an article. Parquet edit support is an upcoming feature for my extension!
Passionate Data Analyst & Scientist | Expert in Power BI, Tableau, and Data Visualization | Enabling Data-Driven Insights for Business Success | Software Engineer | Ai Creator
1 个月This is an incredible post! Your insights and achievements are so inspiring, and I’d love to stay connected to keep learning from you. If you could spare a moment, I’d be so grateful if you could check out my latest post and maybe even follow my profile. Your support would truly mean the world to me!