Europython 2018 - Day 2


This was the second and last day of training. I managed to participate to 2 sessions (no more schedule misunderstanding)

Technologies to master Parallelism in Python

This training was sponsored by Intel, which meant that some of the tools presented are developed by them. Shailen Sobhee introduced us to some techniques to execute Python code in parallel. We started with Multithreading, then Multiprocessing.

That's where my problems started: I have a (brand new) Windows laptop, which I have not yet fully configured. The issue is that multiprocessing is implemented with "fork" in the "U***X" world, while Windows uses "spawn". The behavior is different, which means that the exercises used during the course did not work on my laptop. I have at least to install a virtual Linux machine before my next conference (EuroScipy in Trento, end of August)

After that, we went through joblib and dask. We also learned that popular libraries used in Python, like numpy, use C libraries which also implement multiple flavors of multiprocessing. Shailen also discussed the use of SIMD (Single Instruction Multiple Data) which differ between architectures, and even between generations of processors within the same family.

The end result is that you can have several layers of multiprocessing, visible or not, in a program. The result is multiplicative: if your main program generates 4 processes, which themselve generate 4 threads each, you end up with 16 parallel executions. The more layers you add, the more havoc you can wreak on your poor little machine, and the worst the performance of the applications. Shaileen mentioned 2 techniques which can manage this explosion of tasks; smp (static multi processing) and TBB (threading building blocks).

Takeouts:

  1. In the wonderful world of open software, all operating systems are equal, but some are more equal than others. Although Microsoft is taking care to support Python, you have many libraries that are natively developed on U***X systems, and supported on Windows as an afterthought.
  2. Parallelising the code is a wonderful technique, which should not be abused. Think twice before parallelising, your code (including its supporting libraries) may already be somehow parallel.

Fast native code with Cython

Cython is a library that lets you generate C code from the Python expressions. This lets you bypass the Python interpreter in order to boost performance. The training was led by Stefan Behnel, one of the core developers of Cython.

In order to compile the generated C code, you need a compiler (dduuuhh). The library requires the presence of gcc (native on U***X platforms) or, alternatively, on Windows, of the Visual C++ compiler. As my laptop is brand new, I have not yet installed this bloody compiler, so I could not do any of the exercises.

You can use Cython at different levels of complexity. You can start with calling functions of your favorite C library, and then start moving less and less data back and forth between the (distinct) Python and C environments. The less movements, the more performance, but the less Pythonic your code. Nevertheless, the performance increases can be tremendous. You can, of course, combine Cython with parallelising techniques (see previous chapter...).

Takeouts

  1. In the wonderful world of open software, all operating systems are equal, but some are more equal than others. Although Microsoft is taking care to support Python, you have many libraries that are natively developed on U***X systems, and supported on Windows as an afterthought. Is it my age? I fear I am repeating myself here.
  2. No free lunch. The more you "distort" the original Python code, the more performance you can get.

As you have read, my 2 sessions were about increasing Python performance. Python learning curve is very smooth, due, among others, to the "dynamic typing", but it is also very rich. This makes it very popular and successful. This also has a cost, because Python is inherently sequential (due to the infamous GIL). In order to improve the performance of the programs, there are about 3 techniques:

  1. Wrap existing highly optimised C or Fortran libraries in a well thought out Python wrapper. The best example is probably numpy. The libraries handle the heavy lifting, are not interpreted and can include some kind of parallelism
  2. Go parallel: there are loads of libraries which enable some kind of parallelism. We have seen some of them this morning.
  3. Bypass the interpreter and the GIL. This is typically what Cython does.

All those techniques can interact together, and, before writing code, you'd better think about the tools you are going to use, the level of optimisation you want, and also about maintainability and portability.

Remember, some environments are more equal than others, and there is no free lunch.

Arif Thayal

Lead Data Engineer | Solution Architect | AI & MLOps

6 年

For python, this underlines the phrase "with great power comes great responsibility".. I liked the power of parallelism and cython, but once you start distorting python, its your responsibility to make it run. Unlike the native python, those wrappers/libraries are difficult for layman stepping into the data science world. :)

要查看或添加评论,请登录

Bruno Hanzen的更多文章

  • Identify this device

    Identify this device

    I bought this device in a curiosa shop. It is probably a device used to demonstrate the propagation of electromagnetic…

    7 条评论
  • Investir en production d'électricité pilotable

    Investir en production d'électricité pilotable

    Voici une rapide estimation pour la Belgique. Il s'agit bien entendu d'ordres de grandeur, pas d'un plan financier.

    2 条评论
  • Ah, les grandes familles!

    Ah, les grandes familles!

    Où les archives d'un ancien Premier Ministre reviennent sur la liaison du roi Baudouin avec sa maratre. Reprenons au…

    7 条评论
  • Une alternative aux sodas?

    Une alternative aux sodas?

    Hier, j'ai préparé un plat à la bière. J'ai utilisé la "bière de table" brune de mon enfance.

    2 条评论
  • Insoutenable

    Insoutenable

    Les chiffres du Hainaut Des chiffres comme il en tombe à peu près tous les jours. Parmi ceux-ci, il y en a un qui m'a…

  • Tous foutus?

    Tous foutus?

    Nous sommes tous exceptionnels! La jeune génération est angoissée par le climat. Ma génération a été angoissée par la…

    1 条评论
  • Europython 2018 - Day 4

    Europython 2018 - Day 4

    Fourth day of the conference. Fatigue starts to take its toll, with some people taking naps where they find a quiet…

  • Europython 2018 - Day 3

    Europython 2018 - Day 3

    Today was the first conference day proper, after 2 days devoted to trainings. As there was some rush at the…

  • Europython 2018 - Day 1

    Europython 2018 - Day 1

    I just finished my first day at Europython, in Edinburgh. Travelling always brings its lot of surprises, like waiting 1…

  • Too old to rock and roll: too young to die

    Too old to rock and roll: too young to die

    Dear Connections, Remember Jethro Tull? Well, you may be too young for that. But you might nevertheless want to listen…

    14 条评论

社区洞察

其他会员也浏览了