Realtime Audio processing with Linux
Introduction
Do you drive a recent car ? Chances are your infotainment system, i.e. the central console which hosts a wealth of applications and convenience functions (navigation, media playback, telephony, etc) is based on Linux.
Linux is now the "de facto" standard for infotainment. Although it's still not ready (and probably will never be) to handle safety relevant components within a car, it is already mature and efficient enough to manage time critical applications like the complex real-time audio and control flows which are the backbone of your car's infotainment.
In this series of articles we are going to show how real-time audio processing can be implemented in Linux, starting from the basic concepts of audio processing, and going through some examples of increasing complexity.
Basic concepts of sound processing
What is sound ?
Sound is a variation of air pressure, which is typically produced by the vibration of objects (strings of a violin, vocal cords, etc) and transmitted through the air. Sound is picked up by the eardrum in our ears where it gets converted into electrical impulses which then get processed by our brains (ok... sorry about this very inaccurate description, it's just to give some context to what follows).
The higher the frequency at which objects vibrate (and therefore the variations in sound pressure), the higher the pitch we percieve will be. For instance, a central A on a piano keyboard has a frequency of 440 Hz, whereas the A of the next octave will be at 880 Hz.
How do we capture and produce sound ?
There are two electrical components which constitute the endpoints of audio processing systems:
The microphone
The microphone picks up air pressure and converts it to voltage. If we were to measure the voltage produced by the microphone while playing a pure central A note we would see something like this:
As a central A is a sound at 440 Hz, the time distance between two peaks (a.k.a. the period) will be 1/440 = 0,0023 seconds.
The speaker
Speakers convert voltage to air pressure. If we were to drive a speaker with the signal produced by the microphone it would produce a central A note.
Well, not really
Ok, this is a very simplified view of reality, because (among other reasons):
How do we take sound in and out the digital domain ?
The signal produced by the microphone is an analog signal, i.e. a continuously changing voltage level which mimicks the air pressure changes. Not something a CPU can directly digest.
This signal must be converted into something that a CPU can process. Therefore the next step in the audio processing chain is sampling: a process by which a continuous analog signal is captured in a sequence of equidistant moments in time.
The interval between two samples defines the so called sampling frequency. The higher the frequency (hence, the shorter the distance in time) the more accurate the representation of the signal will be. However, this will also require a more powerful CPU to process all the samples in realtime.
Commonly used sampling frequency are:
Higher frequencies are also occasionally used but only in high end audio systems, to allow for complex signal processing operations. The audible range for the human ear has an upper limit of about 22kHz, therefore in normal applications it makes no sense to sample a signal at a frequency higher than the double of that number.
After sampling we have a signal which is continuous in value, but discrete in time. The next step is to convert those electrical values into numbers, and this is carried out by a componed called analog-to-digital converter. Its purpose is to measure electrical voltages and create their numerical representation. Analog to digital converters (ADC or A/D for short) output an integer number which can be represented with various bit widths, the most common being 8, 10, 12, 16, 20, 24, 32. Using a larger number of bits to represent a sample will create a more accurate the approximation of the signal (generating a better audio quality), but it will also imply the need for a more powerful CPU to process all the data.
In practice, sampling and analog-to-digital conversion are physically performed by the same chip.
After the signal has been converted to a digital format it can be brought into the CPU where it will be processed. The CPU can then spit out a sequence of numbers which will be fed into a component called digital-to-analog converter (DAC or D/A for short) whose job is the opposite of the ADC's: convert numbers back into analog signals. The output of the ADC is then amplified and fed to the speaker.
Therefore, the complete processing chain for a simple mic-cpu-speaker audio processor will be as follows (this is obviously a very simplified view)
Also, there can be cases where there are more channels involved in audio processing: for a stereo signal we will have two microphones and two speakers, each of which will have its own set of amplifiers, DACs and ADCs. More than two channels are obviously possible, for example in a normal car there will be two front speakers and two back speakers, whereas in more sophisticated infotainment systems the number of channels can be much larger.
Processing
Now that we have ways to get numbers representing sound in and out our CPU we can think about audio processing.
To put it simply, audio processing consists of the following steps:
Although this solution looks simple and efficient (we process incoming data as soon as we can) it's hardly feasible.
Considering a typical CD quality sampling frequency of 44.1 kHz, we would receive an input sample every 0,000022676 seconds (22 microseconds). Every time a sample is received, an interrupt is triggered to the CPU. So, either we execute the entire audio processing algorithm in the interrupt handler (doable but certainly neither easy nor recommended) we will need to execute the following chain:
The problem here is that steps (5) and (8) require a context switch, which under Linux on an average system may need a few microseconds. So with this solution we would be wasting a lot of CPU resources to continuously switching back and forth among processes. Also, context switching is highly unreliable in terms of execution latency (unless some sort of realtime extension to the kernel has been used), therefore potentially leading to loss of input samples (a.k.a. overrun: an input value gets overwritten before we have the time to process it) or output samples (a.k.a. underrun: we can't provide a required value in time).
The solution is relatively simple: it consists of buffering data and processing it in chunks instead of processing it on a one-by-one basis.
领英推荐
Buffering
To overcome problems seen just above, we introduce two buffers, one for input samples, one for output samples:
The processing sequence then becomes:
The number of samples in the input buffer which will trigger an input event is typically configurable by software (and likewise the number of samples in the output buffer). A typical value could be a number of samples corresponding to 10ms. At 44.1 kHz, this means 441 samples will be present in the buffers.
Dual / multiple buffers
In the sequence above, consider what happens when the last sample is inserted into the input buffer. Samples (at 44.1 kHz) get inserted by the input DMA every 22 microseconds.
When the last sample is inserted in the buffer, the DMA will trigger an interrupt to the CPU. From that moment, the CPU has only 22 microseconds before the DMA writes the next sample into the buffer. If this time constraint is not met (and it's very likely that it won't) the buffer will already contain a sample which belongs to the next period, causing audio artefacts. A similar situation occurs on the output buffer.
For this reason the solution which is normally adopted is to have multiple input and output buffers to give the CPU some maneuvering time:
In this solution, the input DMA will begin filling input buffer 1. Once that buffer is complete, it will trigger an interrupt to the CPU and begin filling buffer 2 while the CPU processes buffer one. When buffer 2 is also full, it will trigger an interrupt to the CPU and go back to filling buffer 1 while the CPU processes buffer 2, and so on.
In this way the CPU has the whole buffer time (which can be in the range of 10 ms) to process a buffer.
Likewise, on the output stage while the CPU fills buffer 1 the output DMA will stream buffer 2 to the speaker. Once a buffer has been completely depleted, an interrupt is triggered and the two buffers are swapped.
This solution, with two buffers, is also called a "ping pong buffer".
In case we have an extremely overloaded system, in which the CPU might occasionally not make it in time to process a buffer before the next one gets filled (or depleted, in case of output buffers) it's possible to use more than two buffers. The idea is that the CPU is fast enough to process all data within a buffer in time, but it might have some sporadic moments in which it can't respond in time to interrupts. Additional buffers help compensate for that.
Drawbacks of multiple buffers
Let's consider a simple case:
From the moment the DMA inserts the first audio sample into an input buffer to the moment the CPU is able to process it we have a latency of 10ms. From the moment the CPU writes the output sample to an output buffer to the moment when the output DMA streams it out we also have 10 ms).
Therefore in this configuration we have an overall latency of 20ms from input signal to output signal.
When input and output buffers increase in number the situation gets even worse, as the total latency becomes:
Latency = (PERIOD_TIME) * (N_INPUT_BUFFERS - 1) + (PERIOD_TIME) * (N_OUTPUT_BUFFERS - 1)
20 ms may not sound like much, but if we are implementing a realtime audio application these delays can become very noticeable. Imagine, for instance, an application which processes a singer's voice in realtime and applies some digital filters: the output signal will be delayed by a small but perfectly hearable delay.
The solutions can only be keeping the number of buffers to a minimum (less than 2 just won't work) and keeping the buffer time as short as possible.
This requires:
Audio flow in Linux terms
We have seen many simple concepts which define how audio data are captured, organised, and streamed out. Let's see how these concepts are represented in Linux.
Audio processing in Linux is managed by a framework called ALSA (Advanced Linux Sound Architecture). So more specifically we are going to look at those concepts in ALSA terms.
To map the general concepts to ALSA terms, we can sketch this mapping:
Please notice the discrepancy in terms between the more "physical" view we have seen before and the ALSA terms.
What we called "dual buffer" or "ping pong buffer" solution should be called, in ALSA terms, "dual period", as the term "buffer" refers to the container which holds both.
In the picture above we can see a graphical representation of our example:
What's next ?
In this article we have explored the basic principles of audio processing. Such concepts apply to any audio processing platform, although the description has become more Linux-esque in the end.
In the next article we will investigate deeper into how these general concepts map to the ALSA world.
You can find the second part here:
Software engineer @ Chainguard | Open source
2 年Very interesting and well-written! I red the first part as of now. Thank you!
Technical Director presso MECT S.R.L.
2 年Great Guido...it will be very interesting
Supervisor Commissioning at Starrag Technology GmbH
2 年Sounds very interesting!