登录查看更多内容

Code Optimisation - What's all the fuss about?! - Part 1

Gaurav Sharma

Systems Engineering | Ex - Cisco

发布日期: 2020年2月23日

In my previous article, I had discussed some of the basic entities of Digital Signal Processing and FFT, that are integral to a DSP engineer’s day to day tasks.

You may have a quick look at the previous article by clicking on the link below:- https://www.dhirubhai.net/pulse/my-tryst-digital-signal-processing-gaurav-sharma/

This article is all about optimising your program as you go about it. For those of you who have already worked in program optimization would know about the efforts that it takes to make a program efficient.

But wait, let’s do some thinking here and ruminate - What program is considered optimized or efficient?

Honestly, I prefer not to be uptight about one definition of an optimized code because there are various factors involved. It totally depends upon a programmer’s motive, who may want to:-

1. Improve the speed of execution

2. Memory Consumption

3. Making the code secure by eliminating vulnerabilities

This article will put forward some of the good practices that should be adopted while writing and optimising a code. For a better understanding, I have taken the C/C++ programming language into consideration.

For those of you who are uninitiated, there are a few general points to keep in mind before going forward with program optimisation:-

An optimized code might generate a program with a bigger memory footprint i.e amount of memory that a program uses.
The optimisation goals might conflict with each other. For an instance - A code that is optimized for memory footprint may conflict with the goal of optimising it for a better performance.

So there should be a balance between the two in order to get the best of both worlds.

It is often difficult to determine the parts of the code that are taking up maximum resources just by going through speculations so it is advisable to use a profiler in order to identify the bottlenecks in the code.

Valgrind,Callgrind are some of the profilers that are available for C/C++.

The optimizations should be done on those parts of the program that are run the most, especially those methods which are called repeatedly by various inner loops that the program can have.

Let’s optimize!!

The most recent project that I worked on, required me to optimize the Audio and FFT routines running on an android application, having more than 12 threads running simultaneously.

The main focus of the optimisation was to reduce the CPU usage of android mobiles while running this application that continuously accumulates audio data and does FFT on it.

There were two clear and distinct approaches that could be taken w.r.t optimisation :-

Platform Agnostic Optimisation - programming language specific semantics,data structure and algorithms used in the code.
Platform Dependent Optimisation - Interacting with the underlying hardware to improve arithmetic and logical operations being performed on the data.

Let’s explore both the aforementioned approaches one-by-one:-

Platform Agnostic Optimisation

Platform Agnostic Optimisation refers to the optimisation techniques that are independent of the target machine.

Before starting to optimize a software, it is imperative that we understand the low-level and high-level operations that are performed on the data at play.

Note:- In a computer architecture, the cost of operation corresponds to the number of instructions executed to perform the task. A programmer may have a good look at the disassembly of a C/C++ code[giving an assembler level output] to count the number of instructions for a particular function or operation.

In a C/C++ software, among a lot of other things,there are

Loops
Arrays
Memory references

Loops

Loops have played a major role in our audio application where audio is sampled and FFT is computed continuously thereupon. In such an application where a loop is iterated significant number of times on an operation to produce a result, it is worthwhile to consider some of the loop optimisation techniques that helps in making the process efficient and quicker:-

Ensuring that the loop invariants are outside the loop.
Loop invariants are the conditions that hold for every iteration before and after the loop. For Example:-

int p = 19;

int q = 20;

int sum = 0;

for(int i=0; i<20; i++){

sum=sum+p+q+i; //p and q have the same values before and after every iteration

}

Loop unrolling - Many compilers handle this but if we know that our compiler is not unrolling the loops then we can do it in the program itself.

Old code -

for (i = 0; i < 20; i++)

{ do_something(i);

}

Loop Unrolled Code:-

for (i = 0; i < 20; i=i+3) {

do_something(i);

do_something(i); }

Avoid loop carried dependencies in C as the compiler cannot vectorise it — if the code contains a loop where the result of one iteration is affected by the result of a previous iteration.

Example of a loop carried dependency:-

for( i = 1; i <=100; i++) {

a[i+1] = a[i] + c[i];

b[i+1] = b[i] + a[i+1]; }

Avoid conditional statements inside loops,if possible.
When the data is stored in structures, it is good practice to write loops to use all the contents of the structure simultaneously. This exploits the cache better.
Loop jamming - Combining two or more adjacent loops that iterate over the same range.
Using pre-increment in the loop statement if possible, for the iterations instead of using post-increments as post-increments have to first load the value in the CPU register, then increment it and then again store it.

Arrays

Use Pointers to access the arrays in order to generate a faster code.

Have a look at the code snippet below:

for(int i=0; i<n; i++) {

nameArray[i] = value; }

The name of the array is a constant pointer that points to the first element of the array, so we can use pointer arithmetic. Instead of the above code snippet, the following is better:

for(int* arrPoint = nameArray; arrPoint<nameArray+n; arrPoint++){

*arrPoint = value;

}

In the above code snippet, we have taken a pointer to int data type that takes address from the name of the array. This pointer is moved towards the end of the array.

If a matrix is being used in a program and the developer has a choice to approach the elements row by row, then always access the elements row by row. The reason is that the arrays are stored in memory row after row so it is the natural way to approach an array.
Always avoid initializing large portions of memory with some element. In case it is unavoidable, then use memset.

Memory References

Before proceeding, let’s go through an important concept known as Locality of Reference.

Locality of reference - It is defined as the property of a program to use the addresses which are close to other recent references to the memory.

Locality of reference can be affected by splitting the data structures into “more frequently” and “less frequently used” portions and putting all the “more frequently” data packed together in the form of structures.

In operating systems, the principle of locality allows the system to use main memory as a cache of the most recently referenced chunk of virtual address space and also in case of recently used disk blocks in disk file systems.

Repeatedly references to variables are considered good for maintaining locality of reference.

Avoid copying large buffers like strings or structures, instead copy the pointer to them as copying and passing the whole buffer is a slow process.

Use “restrict” keyword with pointer declarations - It is a way of telling the compiler about an optimisation that the declared pointer “ptr” is the only way to access the object pointed by this and so the compiler can avoid the additional checks, limiting the effect of pointer aliasing.

For Example -

void operation(int* ptr1, int* ptr2, int* ptr3){

*ptr1 = *ptr1 + *ptr3;

*ptr2 = *ptr2 + *ptr3;

}

In the above code snippet, the pointers ptr1, ptr2 and ptr3 might refer to the same memory location so the compiler may not be able to generate an optimised code.However, if the restrict keyword is used in the pointer declarations :-

void operation(int* restrict ptr1, int* restrict ptr2, int* restrict ptr3);

Now the compiler assumes that the 3 pointers refer to different memory locations.Hence it will produce an optimised code. Try to have a look at the disassembly for both the cases and you will notice the difference.

Apart from the techniques discussed so far, there are a few other practices that one might adopt to make the code optimised even further. Some of them are as follows:-

Eliminating unnecessary function calls, conditional tests, and memory references.
Making the functions inline results in a faster code as it could greatly reduce function call stack overheads at the cost of code size.
Using ‘unsigned int’ instead of ‘int’ if the developer knows that the value will never be negative, as some processors handle unsigned integer arithmetic faster than signed int.
Using ‘switch’ cases instead of ‘if else’ - if a large number of decisions is to be made.Also test the most likely case first, if possible.
Use Lookup Tables wherever possible - This comes at the cost of memory utilisation, in which pre calculated data is stored in a table and is accessed in constant time. It is heavily used in the CRC calculation used in SMBUS protocol and processes that require sin/cos function values.
Use ‘register’ keyword with the variables which are heavily used - It hints the compiler to store the value in a CPU register which provides a much faster access.
Limiting the operations like opening ,reading,writing a file,copying significant amounts of data, searching, sorting on an entire array as they are some of the costliest operations.

The list is ever-growing and exhaustive, so I would like to limit it with the aforementioned points. Do let me know in the comments section if you wish to add something to the above list. In Part-2 of this article, we will go through the optimisation techniques that are dependent on the underlying hardware platform. Until then stay tuned!!

Nikhil Chawla

Software Engineer @ Microsoft | Azure

5 年

Nice one dude ??

himanshu sahdev

Systems Engineer at NXP Semiconductors

5 年

Great article gaurav! I would love to add a bit to the loops part above: We can also use frequency reduction, i.e decreasing the amount of the code in loop(move outside without affecting the semantics of the program) int p = 19; int q = 20; int sum = 0; int t = p + q; //can be more than just addition(a complex calculation) for(int i=0; i<20; i++){ ? ? sum=sum+t+i; } #keep_sharing?#waiting_for_the_next_part?#optimization

1 次回应

查看更多评论

要查看或添加评论，请登录

Gaurav Sharma的更多文章

Code Optimisation - What's all the fuss about?! - Part 2

2020年3月4日

Code Optimisation - What's all the fuss about?! - Part 2

Platform Dependent Optimisation In Part-1 of this article, we had discussed some of the platform independent…
My Tryst with Digital Signal Processing

2020年1月17日

My Tryst with Digital Signal Processing

Let’s do a very brief mind exercise before going forward,shall we?! Imagine the things that you had started working on…

5 条评论
Unveiling 'smartM alpha v1.0'

2017年3月16日

Unveiling 'smartM alpha v1.0'

Finally, the first prototype of our Smart Metering system is hot and ready..

2 条评论
Getting into the world of Smart Metering!!

2017年3月3日

Getting into the world of Smart Metering!!

Smart metering is generally referred to as a utility for monitoring and billing of the electrical energy consumption in…

3 条评论
Making 'Things' live with AWS IOT

2017年2月8日

Making 'Things' live with AWS IOT

Hey i am back with the follow-up of my previous post ARM Cortex M-3 based Temperature and Humidity Monitoring system as…
ARM Cortex M-3 based Temperature and Humidity Monitoring system.

2017年2月1日

ARM Cortex M-3 based Temperature and Humidity Monitoring system.

Hey i am back again !! It has been quite a while now, since i have published my work. I have been working with an ARM…

1 条评论
Web Socket'ing' Raspberry Pi and ESP8266 Wi-Fi module

2016年11月13日

Web Socket'ing' Raspberry Pi and ESP8266 Wi-Fi module

While refurbishing our IEEE-NIEC Makers' Space a few weeks back,I came up with a problem of working out an IOT…

4 条评论
Sensors talking to the Internet !!

2016年10月16日

Sensors talking to the Internet !!

Zigbee Gateway Have you ever witnessed Sensors talking to the Internet? Well here it is!! IEEE-NIEC present its…

1 条评论
TI's CC3200 saying Hello via Web Socket !!

2016年10月16日

TI's CC3200 saying Hello via Web Socket !!

Just worked up Texas Instrument's CC3200 Launchpad with MQTT over Web Socket. Wonderful board to prototype real-time…

2 条评论

See all articles

Code Optimisation - What's all the fuss about?! - Part 1

Gaurav Sharma

Systems Engineering | Ex - Cisco

Avoid copying large buffers like strings or structures, instead copy the pointer to them as copying and passing the whole buffer is a slow process.

Gaurav Sharma的更多文章

社区洞察

其他会员也浏览了

Writing Async Code in?Rust

Unique IDs in programming

Go Profiling 0 to 1: A Beginner’s Guide to Analysing Your Code’s Efficiency

Concurrent and Parallel Programming

Constant Folding in C++

C++17’s Useful Features for Embedded Systems

Rust vs. C++: Performance, Safety, and Trade-offs in System Programming

Datagram Routing Considered Harmful

The Null Pointer: Tony Hoare's Billion-Dollar Oopsie

Enhancing Code Quality with MISRA?-C Rules and GCC Options #14

Avoid copying large buffers like strings or structures, instead copy the pointer to them as copying and passing the whole buffer is a slow process.

Gaurav Sharma的更多文章

Code Optimisation - What's all the fuss about?! - Part 2

My Tryst with Digital Signal Processing

Unveiling 'smartM alpha v1.0'

Getting into the world of Smart Metering!!

Making 'Things' live with AWS IOT

ARM Cortex M-3 based Temperature and Humidity Monitoring system.

Web Socket'ing' Raspberry Pi and ESP8266 Wi-Fi module

Sensors talking to the Internet !!

TI's CC3200 saying Hello via Web Socket !!

社区洞察

其他会员也浏览了

Writing Async Code in?Rust

Unique IDs in programming

Go Profiling 0 to 1: A Beginner’s Guide to Analysing Your Code’s Efficiency

Concurrent and Parallel Programming

Constant Folding in C++

C++17’s Useful Features for Embedded Systems

Rust vs. C++: Performance, Safety, and Trade-offs in System Programming

Datagram Routing Considered Harmful

The Null Pointer: Tony Hoare's Billion-Dollar Oopsie

Enhancing Code Quality with MISRA?-C Rules and GCC Options #14