登录查看更多内容

Un-pickle-able

Alessio Palma

Data Engineer II

发布日期: 2024年1月26日

The Python support in Spark is impressive, yet sooner or later, it becomes clear that Spark is not in Python. Sometimes, such a reality comes into the foreground as unpickable stuff that refuses to get distributed on the cluster.

What's the story?

More or less is something like this:

from a_library_with_binary_bindings import SpecialClass 

customization = {
   "parameter": "value"
}

client = SpecialClass(**customization)

def my_function(row):
    client.do_amazing_things_on(row)
.
:
 Bla Bla Bla Bla ...
:
.
df.writeStream.outputMode("append").foreach(my_function).start()

Boom!

TypeError: cannot pickle 'XXXXXX' object

Why does it happen?

The fact is that "my_function" must run on the workers, and Spark needs to distribute it, so it will attempt to serialize it for distribution. The function is written in Python, and it is still Python that has to do the job. However, due to the binary bindings, the XXXXX objects needs some libfoo.so.1.2.3 that the pickle library can't serialize.

Fortunately, a_library_with_binary_bindings library is supposed to be installed on the worker nodes too, so we can change the code a little to ship the Python code that starts a ready-to-use client instance instead of the client itself. Something like this:

from a_library_with_binary_bindings import SpecialClass 

customization = {
   "parameter": "value"
}

def my_function(row):
    client = SpecialClass(**customization)
    client.do_amazing_things_on(row)

df.writeStream.outputMode("append").foreach(my_function).start()

This works because my_function contains the Python instructions to start a client, and so far, there is not yet such a client in the story.

Unfortunately, you will end up starting the client for every my_function call.

Let's try this other solution:

from a_library_with_binary_bindings import SpecialClass 

customization = {
   "parameter": "value"
}

def MyClassContainer:

    def __init__(self, customization):
        self._customization = customization

    def client(self,):
        if not self._client:
           self._client = SpecialClass(**self._customization)
        return self._client 

    def __call__(self, row):
        self.client().do_amazing_things_on(row)

my_function = MyClassContainer(customization)

df.writeStream.outputMode("append").foreach(my_function).start()

Two paramount facts to consider.

the client does not start until it is required (lazy initialization), and once started is recycled. It just starts once. Eventually, when the client is required, the code is already into the workers.
Although my_function is a class, you can use it as a function, because of the __call__ method.

领英推荐

Explaining Those Confusing Python Concepts

Benjamin Bennett Alexander 3 个月前

5 Cool Things You Must Know About Python Dictionaries

Benjamin Bennett Alexander 1 年前

10 Things Every Python Developer Should know About…

Benjamin Bennett Alexander 3 年前

The unpickable side of Python

For the sake of completeness, even in the case of Pure-Python staff:

Sockets: These are used for network communications and cannot be pickled because they represent a live connection to a network resource.
File handlers: These are used to read and write files and cannot be pickled because they represent an open file on the system.
Database connections: These are used to interact with a database and cannot be pickled because they represent a live connection to a database.

The general rule of thumb is that “logical” objects can be pickled, but “resource” objects (files, locks) can’t because it makes no sense to persist/clone them.

Un-pickle-able

Alessio Palma

Data Engineer II

What's the story?

领英推荐

The unpickable side of Python

Further reading:

Alessio Palma的更多文章

社区洞察

其他会员也浏览了

Python Modules: Five Interesting Modules you Should Know

The Anatomy of a Python Class

Python Fundamentals: The Building Blocks of Python Language.

How to Get the Index of an Item in a List in Python

Packaging Python and PyTorch for a Machine Learning Application

List of 63 Python os Modules in 2018

Did Python get even more popular with Mojo ???

How to Handle Time Zone in Python?

Python vs. Mosel - When/Where/Why

How to use Python dictionaries

What's the story?

领英推荐

The unpickable side of Python

Further reading:

Alessio Palma的更多文章

Productivity doubler.

It is faster now.

The thing I learned from COVID-19

社区洞察

其他会员也浏览了

Python Modules: Five Interesting Modules you Should Know

The Anatomy of a Python Class

Python Fundamentals: The Building Blocks of Python Language.

How to Get the Index of an Item in a List in Python

Packaging Python and PyTorch for a Machine Learning Application

List of 63 Python os Modules in 2018

Did Python get even more popular with Mojo ???

How to Handle Time Zone in Python?

Python vs. Mosel - When/Where/Why

How to use Python dictionaries