Rust: Python’s New Best Friend – A Data Scientist’s Journey

Rust: Python’s New Best Friend – A Data Scientist’s Journey

As Python continues to dominate data science, a quiet revolution is happening underneath the surface. Increasingly, Rust is powering our most critical Python tools—bringing unprecedented performance while maintaining the Python interface we know and love. This hybrid approach transforms our work as data scientists, enabling rapid development and production-grade performance.

My journey with Rust began six years ago as a distant curiosity. I heard the name in conference talks and saw it climbing GitHub’s language popularity charts, but it remained just another programming language on my “maybe someday” list.

That changed when Hugging Face released their tokenizers package—a blazingly fast NLP preprocessing library written in Rust with Python bindings. The performance gains were impossible to ignore: what took seconds in pure Python implementations was now completed in milliseconds.

The Two Modes of Data Science Work

Data scientists typically work in one of two distinct modes:

  1. Tool Users: This is where we conduct experiments, build models, analyze data, and generate insights. In this mode, we’re focused on delivering business value directly through our analysis and models.
  2. Tool Makers: This is where we create infrastructure, utilities, and frameworks that enable more efficient work in the first mode. Here, we’re building the foundations that make our primary work sustainable and scalable.

About 95% of our time is spent in the first mode, where we see software as a means to an end, not the end itself. Our primary goal is delivering insights and solutions, with code being the vehicle rather than the destination.

This distinction explains why the Python+Rust combination is so powerful. When we’re in “tool user” mode, Python’s expressiveness and ecosystem are hard to beat. But Rust becomes an exceptional partner when we shift to “tool maker” mode—building components that must be fast, reliable, and resource-efficient.

The tools I’ve been gravitating toward—tokenizers, Ruff, Polars, and UV—exemplify this second mode. They’re not replacing Python for data science work but rather enhancing the infrastructure that makes that work possible and productive.

The Toolmaker’s Path vs. The One-Language Approach

Although I’m not a hardcore Rustacean, these developments inevitably drew me toward Rust. I’ve started learning the language and examining the source code of several high-quality projects. The Rust+Python combination feels like what I’d call “the toolmaker’s way”—build your performance-critical infrastructure in Rust, then expose a friendly Python API for widespread adoption.

This approach contrasts with Julia’s strategy. Julia aims to solve the two-language problem with a single language—an elegant, theoretically more cohesive approach. It’s gaining momentum, particularly in academic and research settings. The syntax feels natural for mathematical expressions, and the ability to go from high-level abstractions to low-level optimizations within the same language is appealing.

Yet, for now, the Python+Rust combination offers something uniquely practical: it leverages Python’s vast ecosystem while strategically replacing performance bottlenecks with Rust components. This hybrid approach doesn’t require wholesale migration to a new language—you can adopt it incrementally, one tool at a time.

The Two-Language Problem

Python has long faced what’s known as the “two-language problem“: we love Python for its readability and extensive ecosystem, but when performance matters, we’ve traditionally had to drop down to C, C++, or Fortran. This creates a significant cognitive load—maintaining expertise in two languages and managing their boundaries.

For decades, this was just the cost of doing business in the Python world:

  • NumPy, pandas, and SciPy? C and Fortran under the hood.
  • spaCy for NLP? C++ doing the heavy lifting.
  • Want to speed up your code? Learn Cython or write C extensions.
  • Building ML frameworks? Better get comfortable with C++.

Another approach is PyPy, an alternative implementation of Python with a JIT compiler that can significantly speed up pure Python code. While PyPy offers impressive performance gains, it has compatibility challenges with certain C extensions.

These approaches worked but brought their challenges: memory management headaches, segmentation faults, and build system complexity.

Enter Rust: The Game-Changer

Rust is addressing this problem in a uniquely compelling way. It offers C-like performance with memory safety guarantees and a modern developer experience. The transformation in my workflow over the past few years has been remarkable:

My Rust-Powered Python Toolkit

Hugging Face Tokenizers: My first encounter with Rust-powered Python. The performance difference was so dramatic it made me take notice of what Rust could offer.

Ruff: The Python linter that changed everything. Before Ruff, I was a huge fan of Black. What made me switch to Ruff wasn’t just its speed (though it is remarkably fast) but how it’s like Black on steroids. I love its extensive configuration options while still maintaining sensible defaults. Ruff’s comprehensive approach combines linting, formatting, and code quality checks in one tool, making it an essential part of my workflow.

Polars: A pandas-like DataFrame library that handles larger-than-memory datasets with ease. Operations that would bring pandas to its knees complete in seconds with Polars.

uv: My recent switch from poetry+pyenv to uv has been transformative. What I appreciate most isn’t just the speed (though installations that took minutes now complete in seconds) but having one cohesive tool for dependency management. It’s not about those saved seconds—they’re nice but not critical. What matters is having a clean, reliable, and coherent tool for managing my Python environments. uv delivers this with sensible defaults and straightforward usage patterns, making dependency management feel like less of a chore.

Astral (creators of Ruff and uv) is now working on a new Rust-based static type checker for Python. This development is particularly interesting as type checking becomes increasingly important in the Python ecosystem. Meta’s Pyre (written in OCaml) offers excellent performance and precision, while Google’s pytype provides another robust option.

Astral’s approach focuses on minimizing false positives on untyped code, making it easier for projects to adopt typing gradually. This addresses key limitations in existing solutions like mypy’s performance issues and pyright’s JavaScript dependencies.

Rediscovering the Joy of Coding with Zed

Perhaps most surprisingly, my editor itself is now Rust-powered. After years in PyCharm (which followed a long stint with Emacs), I’ve switched to Zed.

PyCharm served me well, but it became increasingly heavy and complex over time. While VS Code and PyCharm are excellent IDEs with comprehensive features, they can consume significant system resources. Despite having visual Git tools available, I often returned to the terminal for version control operations.

Zed brought back that “hacky feeling” I remembered from my Emacs days but with modern sensibilities and incredible performance. It’s a proper editor rather than a full IDE, which aligns perfectly with my workflow as someone who still appreciates command-line tools. The responsiveness creates an entirely different relationship with the code—there’s no waiting, just coding.

Why Good Developer Tools Matter for Data Scientists

The era when data scientists could get away with writing spaghetti code POCs and MVPs is over—if it was ever truly acceptable. Structuring your code, and maintaining a clean and reproducible development environment make your projects manageable and sustainable. We write code not just for computers but for our peers and our future selves!

Many of us in data science don’t come from computer science backgrounds. This lack of formal software engineering training is often used as an excuse: “OK, my code is messy, but it gets the job done.” What often distinguishes good data scientists from exceptional ones is their maturity in recognizing the need to follow software engineering best practices. The willingness to learn and apply these practices reflects a more profound understanding that sustainability matters as much as immediate results.

OOP principles, type hints, code formatters, and dependency management tools aren’t just software engineering niceties—they’re vital for sustainable data science work. This is why I’ve become increasingly invested in the quality of my development tools.

Looking Forward

We’re still in the early days of this transformation. Projects like Candle (Hugging Face’s Rust ML framework) suggest a future where more of our computational stack might benefit from Rust’s performance and safety guarantees.

For data scientists like myself, these tools provide immediate productivity boosts without requiring a complete reorientation of our workflow. Zed exemplifies this benefit—it’s not just a Rust showcase but a tool that makes me more productive while rekindling that “hacky feeling” I missed from my Emacs days.

If you haven’t explored these Rust-powered Python tools yet, I highly recommend giving them a try. Your development experience might never be the same again.

Your article will surely provide valuable insights into this growing trend. Zoltan Varju

回复

要查看或添加评论,请登录

Zoltan Varju的更多文章

社区洞察

其他会员也浏览了