Europython 2018 - Day 3
Today was the first conference day proper, after 2 days devoted to trainings. As there was some rush at the registration desk, I started with volunteering some time, and helped hand conference T-shirts over. I did not attend the keynote, but, anyway, I am no big fan of such speeches.
Python in scientific computing: what works and what doesn't
Michele Simionato is a software developper for the Global Earthquake Model foundation, where he support the scientists doing, well, earthquake research. He works in a typical scientific environment, meaning distributed computing with complex processing under CPU, memory and data transfer constraints.
In that context, even the most robust packages can experience some sizing issues. For example, he has recently experienced issues with numpy structured arrays. He uses h5py to access hdfs file systems, and the migration from 1.8 to 1.10 was a "debacle".
He uses geospatial packages, which can sometimes exhibit different behaviors on different platforms.
Communication between machines is handled by celery/rabbitmq, which are used in an atypical use case: long messages instead of small. He has experienced some issues and has started experimenting with zmq (his "plan B") which gives better result. He will not adapt the existing codebase part which does not experience any problem ("if it's not broken, why fix it?").
He has no dask experience yet, but is open to experimenting.
He does not use the following accelerations techniques:
- C extensions: because of gcc dependencies, and lack of C skills in his organisation. Most of his use cases with C extensions are covered with a clever use of numpy.
- Cython: more usable for him than the C extensions, but the speedup he achieves is not impressive enough.
- numba: it delivers a speedup in a parallelised environment, but his code is already parallelised, so little benefit and risk of oversubscription.
- Intel Python distribution: he fears a vendor lock-in, and his tests showed a 20% slowdown.
His conclusion is "Algorithmic solutions are better than technical solutions".
Reliability in distributed systems
Jiri Benes is a technical leader for kiwi.com. They use Python heavily in production, and he explained some techniques he used to improve the resilience of the services:
- use of time-outs and circuit breakers in the code, by using specific function decorators
- "let it crash" philosophy: no over-engineering to prevent a crash, and no hiding of errors.
- development of logging, monitoring (including Application Performance Monitoring) and alerting systems
- definition of escalation procedures in case of alerts or incidents.
Using Pandas and Dask to work with large columnar datasets in Apache Parquet
Peter Hoffmann is a senior developper for Blue Yonder, a company providing datascience (like sales forecast) services to mass retailers.
Blue Yonder uses a typical Python ecosystem (Jupyter, Dask, Pandas) in a distributed computing environment (Apache Aurora, Apache Mesos). The exchange of data between different ecosystems is difficult, so they rely on Parquet file to exchange data with the "external" world. They use the Pyarrow and fastparquet libraries to access the files. These techniques have replaced the native access to their databases and have increased the speed of the data transfers.
Fuzzy Matching - Smart Way of Finding Similar Names Using Fuzzywuzzy
Cheuk Ting Ho is a datascientist working for Hotelbeds group. She has to work with company names, and needs to do some deduplication of (frequently badly spelled) company names.
She uses the "fuzzywuzzy" library, which calculates the Lehvenstein distance between words. The Lehvenstein distance is "the number of deletions, insertions and substitutions" necessary to transform one word into another one. The shorter the distance between two words, the bigger the similarity. There are several variations included in the library, including the possibility to work on sets of several words, through tokenization
I use this library for words comparison (Hotelbeds is not the only company to have issues with company names), and the presentation gave me some ideas for additional use cases.
Stephane Wirtel is an independant Python developper and a Python expert and contributor. Python 3.7 has been released last month, he has explained the most important changes in the new version:
- breakpoints: a new elegant interface to the debuggers. setting breakpoints is now done with a function defined in the language, which is no more debugger-specific.
- Dataclasses (which is a class decorator): allows the definition of complex datastructures in an elegant way, with excellent performance (for old blokes like me, it looks very much like a C structure).
- New time functions, with nanosecond resolution. This had become necessary, because of the speed increase of the systems.
Hrafn Eiriksson works as a software developper for smarkets, a fintech.
Asyncio is a relatively new (since v. 3.4) functionality in Python. It allows asynchronous programming. This kind of service was already provided by some libraries, but they are now integrated in the language.
Hrafn has migrated a mission-critical micro-service to asyncio. His experience is that it is important to trace all dependencies before starting the migration. Debugging can be difficult, especially when synchronous code is calld within an asynchronous function. Bit the final result is worth the effort: the large performance improvement has allowd to remove 80% of the servers, with the associated reduction in cost!