Five Emerging Data Science Tools You Should Incorporate with Python
Python’s extensive ecosystem of data science tools is highly appealing to users. However, the sheer breadth and depth of these tools can sometimes lead to the best ones being overlooked.
Here’s an overview of some of the best newer or lesser-known data science projects available for Python. Some, like Polars, are gaining more attention but still deserve wider recognition. Others, like ConnectorX, remain hidden gems.
ConnectorX
Most data resides in databases, while computations are typically performed outside of them. Transferring data between databases and computation environments can be a bottleneck. ConnectorX addresses this issue by efficiently loading data from databases into various common data-wrangling tools in Python, reducing the workload significantly.
ConnectorX, like Polars (discussed later), leverages a Rust library at its core. This enables optimizations such as parallel data loading with partitioning. For example, data from PostgreSQL can be loaded efficiently by specifying a partition column.
In addition to PostgreSQL, ConnectorX supports reading from MySQL/MariaDB, SQLite, Amazon Redshift, Microsoft SQL Server and Azure SQL, and Oracle. The extracted data can be directed into a Pandas or PyArrow DataFrame, or into Modin, Dask, or Polars via PyArrow.
DuckDB
DuckDB is a powerful, lightweight, and speedy relational database similar to SQLite but optimized for OLAP (Online Analytical Processing) workloads. It features a columnar datastore, ideal for long-running analytical queries, and supports ACID transactions. Like SQLite, it operates as an in-process library, making it easy to set up in a Python environment with a simple pip install command.
DuckDB can ingest data directly from CSV, JSON, or Parquet formats, and it allows for efficient partitioning of databases into multiple physical files based on keys (e.g., by year and month). It supports standard SQL queries and offers built-in features like random sampling and window functions.
Additionally, DuckDB includes useful extensions such as full-text search, Excel import/export, direct connections to SQLite and PostgreSQL, Parquet file export, and support for various geospatial data formats.
领英推荐
Optimus
Cleaning and preparing data for DataFrame-centric projects can be tedious. Optimus simplifies this process with a comprehensive toolkit for loading, exploring, cleansing, and writing data to various sources.
Optimus supports multiple data engines, including Pandas, Dask, CUDF (and Dask + CUDF), Vaex, and Spark. It can load and save data in formats like Arrow, Parquet, Excel, various databases, CSV, and JSON.
The API is similar to Pandas but adds .rows() and .cols() accessors for easier DataFrame manipulation, such as sorting, filtering, and altering data based on criteria. It also includes processors for common data types like email addresses and URLs.
However, Optimus is still under active development, and its last official release was in 2020, which might mean it’s not as up-to-date as other tools in your stack.
Polars
If you frequently work with DataFrames and find Pandas' performance limiting, Polars is an excellent alternative. This DataFrame library for Python offers a familiar syntax similar to Pandas but is built on a Rust library that maximizes hardware performance out of the box. It automatically utilizes performance-enhancing features like parallel processing and SIMD without requiring special syntax, making even simple operations like reading CSV files faster.
Polars supports both eager and lazy execution modes, allowing queries to be executed immediately or deferred until necessary. It also includes a streaming API for incremental query processing, although streaming support for many functions is still in development. Rust developers can extend Polars using pyo3.
Snakemake
Setting up data science workflows can be challenging and maintaining consistency even more so. Snakemake automates the creation of data analysis workflows, ensuring consistent results. Many data science projects use Snakemake for this reason. The more complex your workflow, the more you’ll benefit from Snakemake’s automation.
Snakemake workflows are defined similarly to GNU make, with rules specifying inputs, outputs, and commands. Workflow rules can be multi-threaded, and configuration data can come from JSON or YAML files. Functions can be defined within workflows to transform data, and actions taken at each step can be logged.
Snakemake jobs are portable and can be deployed in Kubernetes-managed environments or specific cloud platforms like Google Cloud Life Sciences or Tibanna on AWS. Workflows can be "frozen" to use a specific set of packages, and unit tests for successfully executed workflows can be automatically generated and stored. For long-term archiving, workflows can be saved as tarballs.