Profiling Python the easy way
Python is an interpreted language. Rust is a compiled language. - You might've even heard that interpreted code always runs slower than compiled code.
So, what is the problem with slow Python code?
Slow Python code is a problem for every user of a Python script every time it is used to do some computation. The interpreter works through the source code every single time the script is called. Yikes!
I wrote the code in minutes! Now it only takes a day or so to run.
Python has arguably become the de facto standard in academia and many AI frameworks, despite its lack of performance: The language allows for rapid prototyping (testing ideas) and more permanent solutions (data analysis pipelines) both.
There's nothing more permanent than a temporary solution.
When your computational problem is finally solved, it's time to take another look at the performance of your Python code. After all, your computer could already help you get the next thing done instead of drawing spinning wheels or erratic progress bars! But how do I know where to start?
A simple Python example
I've created a simple and quite arbitrary example script named example.py that solves a hypothetical problem:
This code is used to generate a list of square numbers from 0 up to (but not including) 100,000,000. To provide a visual indication of progress, it prints the current square number in the terminal.
The visible result is a very long list of elements printed in the terminal:
0, 1, 2, ..., 99999999
But it takes about a minute to complete... not so great.
cProfile analyzes Python scripts
Let's have a look at where we spend most of the time when solving our problem: We can use cProfile, Python's built-in profiler to analyze our script and give us a meaningful summary. The profiler can be used as a Python module (`-m cProfile`) and format the summary by cumulative time (`-s "cumtime"`) after the script (`solved.py`) has been interpreted.
python -m cProfile -s "cumtime" solved.py
It's often easier to write the output to a text file instead of the terminal, so we change our command by forwarding our terminal output to a text file:
python -m cProfile -s "cumtime" solved.py > terminal.txt 2>&1
Run the command and wait for the terminal prompt to return. Now let's look at the terminal output by opening the text file terminal.txt in an editor:
We see the progress of our calculation (the first line) plus the profiling summary. In total, we spend 57.298 seconds executing our script. If we look at the summary table, we can see that we spend 14.884 seconds (25%) with the {built-in method builtins.print} function and 2.499 seconds (4%) with the {method 'append' of 'list' objects}. It's easy to find the corresponding lines in the source code: line 4 and line 3, respectively.
Notice that the actual calculation of the squared numbers isn't listed separately, but it must be roughly the difference between the total time minus the cumulative times listed (39.955 seconds).
Let's first work with the profiling table one line at a time, from top to bottom.
Improvement 1: print less (-50%)
Keeping track of the progress is important (the script might be stuck in a loop and never end), but it seems to be computationally expensive (take a lot of time) to print to the terminal. We could still see how the script is progressing by printing only every 1,000,000 iterations. Let's update our code and profile it again!
领英推荐
We've called the print function only 100 times (`ncalls`) and basically spent no time at all to provide updates to the terminal. Notice that the total number of function calls went from 200,000,004 to 100,000,103. Accordingly, the total execution time is also cut in half to 30.574 seconds. Not bad!
Improvement 2: prepare list (-12%)
It's common practice to prepare lists (in compiled languages called arrays) ahead of time, if possible. We know how many elements our list is going to have, so we don't need to grow it element by element. A list comprehension is a Pythonic way to create lists. Instead of growing the list, we just assign the calculated values to the prepared list by index. Let's see if we gained performance!
We reduced the number of function calls from 100,000,103 to 105 - and reduced the total execution time by about 4 seconds. If we look closely, we can see that the example.py:3(<listcomp>) spends roughly the same amount of time that {method 'append' of 'list' objects} spent before. The net gain means that calling a lot of functions, even if they're as trivial as the append() function can be computationally expensive. Notice that the list assignment of our squared values is part of the total execution time.
If we ignore the Python for loop, the only piece of code that could be improved is the squaring of our increment: i**2. Let's look for an alternative to the ** operator in the Python standard library!
Improvement 3: try alternatives (-24%)
A few hours later...
It looks like the math module has an optimized function called math.pow(x, y) to do the same calculation as our squaring. The docs provide a bit more details:
math.pow(x, y)
Return x raised to the power y. Exceptional cases follow the IEEE 754 standard as far as possible. In particular, pow(1.0, x) and pow(x, 0.0) always return 1.0, even when x is a zero or a NaN. If both x and y are finite, x is negative, and y is not an integer then pow(x, y) is undefined, and raises ValueError.
Unlike the built-in ** operator, math.pow() converts both its arguments to type float. Use ** or the built-in pow() function for computing exact integer powers.
Changed in version 3.11: The special cases pow(0.0, -inf) and pow(-0.0, -inf) were changed to return inf instead of raising ValueError, for consistency with IEEE 754.
The function takes two arguments x, y which are our base (increment) and our exponent (2), respectively. Notice the comparison with the ** operator, which gives an indication that the there's an additional conversion from int to float necessary before the calculation is performed.
The proof of the pudding is in the eating!
Let's update the code and profile again!
Something very unexpected happened! The number of function calls increased dramatically from 105 to 100,000,326 - while the total execution decreased from 27 seconds to 20 seconds. Why?
Execution time depends on many factors, some of which are intuitive ("do more things, spend more time") than others. From our last profile, we can conclude that if you must do a lot of things and there's no way around it, at least be as efficient as possible to save time. It also means that the math.pow(x, y) function is more efficient than the ** operator, even if the results are the same.
Conclusion
If your lists become very large, it is worth considering using NumPy arrays instead. All corresponding array functions are written in the C and precompiled for optimal performance. Alternatively, Polars DataFrames provide DataFrame functions written in Rust and offer parallel execution for improved performance.
However, optimization often means finding a reasonable tradeoff between convenience (slow but maintainable) and performance (fast but complex). The sweet spot is where you feel comfortable sharing your code with other developers while at the same time being mindful of other people's most valuable resource:
Time.
Summary
Code optimization guided by profiling can reduce the total execution times significantly. We found simple tweaks that cut the total execution time to 1/3 of its original value while maintaining code readability.
And please don't forget:
Premature optimization is the root of all evil. - Donald Knuth (1974)