Realtime Audio processing with Linux

Realtime Audio processing with Linux

Introduction

Do you drive a recent car ? Chances are your infotainment system, i.e. the central console which hosts a wealth of applications and convenience functions (navigation, media playback, telephony, etc) is based on Linux.

Linux is now the "de facto" standard for infotainment. Although it's still not ready (and probably will never be) to handle safety relevant components within a car, it is already mature and efficient enough to manage time critical applications like the complex real-time audio and control flows which are the backbone of your car's infotainment.

In this series of articles we are going to show how real-time audio processing can be implemented in Linux, starting from the basic concepts of audio processing, and going through some examples of increasing complexity.

Basic concepts of sound processing

What is sound ?

Sound is a variation of air pressure, which is typically produced by the vibration of objects (strings of a violin, vocal cords, etc) and transmitted through the air. Sound is picked up by the eardrum in our ears where it gets converted into electrical impulses which then get processed by our brains (ok... sorry about this very inaccurate description, it's just to give some context to what follows).

The higher the frequency at which objects vibrate (and therefore the variations in sound pressure), the higher the pitch we percieve will be. For instance, a central A on a piano keyboard has a frequency of 440 Hz, whereas the A of the next octave will be at 880 Hz.

How do we capture and produce sound ?

There are two electrical components which constitute the endpoints of audio processing systems:

The microphone

The microphone picks up air pressure and converts it to voltage. If we were to measure the voltage produced by the microphone while playing a pure central A note we would see something like this:

No alt text provided for this image

As a central A is a sound at 440 Hz, the time distance between two peaks (a.k.a. the period) will be 1/440 = 0,0023 seconds.

The speaker

Speakers convert voltage to air pressure. If we were to drive a speaker with the signal produced by the microphone it would produce a central A note.

No alt text provided for this image

Well, not really

Ok, this is a very simplified view of reality, because (among other reasons):

  • A real sound source will never produce a pure sine wave
  • Voltage produced by a microphone will typically be very small, in the range of few millivolts, therefore requiring amplification and conditioning before being used for any purpose.
  • Driving a speaker requires further amplification

How do we take sound in and out the digital domain ?

The signal produced by the microphone is an analog signal, i.e. a continuously changing voltage level which mimicks the air pressure changes. Not something a CPU can directly digest.

This signal must be converted into something that a CPU can process. Therefore the next step in the audio processing chain is sampling: a process by which a continuous analog signal is captured in a sequence of equidistant moments in time.

No alt text provided for this image

The interval between two samples defines the so called sampling frequency. The higher the frequency (hence, the shorter the distance in time) the more accurate the representation of the signal will be. However, this will also require a more powerful CPU to process all the samples in realtime.

Commonly used sampling frequency are:

  • 8 khz: low quality telephone applications
  • 16 khz: high quality telephone applications
  • 44.1 khz: CD quality audio

Higher frequencies are also occasionally used but only in high end audio systems, to allow for complex signal processing operations. The audible range for the human ear has an upper limit of about 22kHz, therefore in normal applications it makes no sense to sample a signal at a frequency higher than the double of that number.

After sampling we have a signal which is continuous in value, but discrete in time. The next step is to convert those electrical values into numbers, and this is carried out by a componed called analog-to-digital converter. Its purpose is to measure electrical voltages and create their numerical representation. Analog to digital converters (ADC or A/D for short) output an integer number which can be represented with various bit widths, the most common being 8, 10, 12, 16, 20, 24, 32. Using a larger number of bits to represent a sample will create a more accurate the approximation of the signal (generating a better audio quality), but it will also imply the need for a more powerful CPU to process all the data.

In practice, sampling and analog-to-digital conversion are physically performed by the same chip.

After the signal has been converted to a digital format it can be brought into the CPU where it will be processed. The CPU can then spit out a sequence of numbers which will be fed into a component called digital-to-analog converter (DAC or D/A for short) whose job is the opposite of the ADC's: convert numbers back into analog signals. The output of the ADC is then amplified and fed to the speaker.

Therefore, the complete processing chain for a simple mic-cpu-speaker audio processor will be as follows (this is obviously a very simplified view)

No alt text provided for this image

Also, there can be cases where there are more channels involved in audio processing: for a stereo signal we will have two microphones and two speakers, each of which will have its own set of amplifiers, DACs and ADCs. More than two channels are obviously possible, for example in a normal car there will be two front speakers and two back speakers, whereas in more sophisticated infotainment systems the number of channels can be much larger.

Processing

Now that we have ways to get numbers representing sound in and out our CPU we can think about audio processing.

To put it simply, audio processing consists of the following steps:

  1. Get one sample (or two in case of stereo, or more in case of multichannel) from the ADC into the CPU
  2. Process the input sample(s) and comput output sample(s)
  3. Spit out the computed output sample(s) to the DAC
  4. Repeat

Although this solution looks simple and efficient (we process incoming data as soon as we can) it's hardly feasible.

Considering a typical CD quality sampling frequency of 44.1 kHz, we would receive an input sample every 0,000022676 seconds (22 microseconds). Every time a sample is received, an interrupt is triggered to the CPU. So, either we execute the entire audio processing algorithm in the interrupt handler (doable but certainly neither easy nor recommended) we will need to execute the following chain:

  1. A sample is available
  2. An interrupt is generated
  3. An interrupt handler gets executed in kernel mode
  4. The interrupt handler signals a user-space application which performs audio processing
  5. The user space application wakes up
  6. The user space application processes the sample and computes the output value
  7. The user space application writes the output value to the DAC (probably going throug the kernel unless it has direct I/O access from userspace, which is not a very elegant and safe solution)
  8. The user space application goes back to sleep

The problem here is that steps (5) and (8) require a context switch, which under Linux on an average system may need a few microseconds. So with this solution we would be wasting a lot of CPU resources to continuously switching back and forth among processes. Also, context switching is highly unreliable in terms of execution latency (unless some sort of realtime extension to the kernel has been used), therefore potentially leading to loss of input samples (a.k.a. overrun: an input value gets overwritten before we have the time to process it) or output samples (a.k.a. underrun: we can't provide a required value in time).

The solution is relatively simple: it consists of buffering data and processing it in chunks instead of processing it on a one-by-one basis.

Buffering

To overcome problems seen just above, we introduce two buffers, one for input samples, one for output samples:

No alt text provided for this image

The processing sequence then becomes:

  1. A sample is available
  2. The sample gets stored in the input buffer by a dedicated hardware component (called DMA)
  3. When the input buffer contains a "sufficient" number of samples, an interrupt is generated
  4. An interrupt handler gets executed in kernel mode
  5. The interrupt handler signals a user-space application which performs audio processing
  6. The user space application wakes up
  7. The user space application processes the input buffer and computes output values
  8. The user space application writes the output buffer
  9. The user space application goes back to sleep
  10. A dedicated hardware component (also a DMA) sends the output buffer, one sample at a time, to the DAC

The number of samples in the input buffer which will trigger an input event is typically configurable by software (and likewise the number of samples in the output buffer). A typical value could be a number of samples corresponding to 10ms. At 44.1 kHz, this means 441 samples will be present in the buffers.

Dual / multiple buffers

In the sequence above, consider what happens when the last sample is inserted into the input buffer. Samples (at 44.1 kHz) get inserted by the input DMA every 22 microseconds.

When the last sample is inserted in the buffer, the DMA will trigger an interrupt to the CPU. From that moment, the CPU has only 22 microseconds before the DMA writes the next sample into the buffer. If this time constraint is not met (and it's very likely that it won't) the buffer will already contain a sample which belongs to the next period, causing audio artefacts. A similar situation occurs on the output buffer.

For this reason the solution which is normally adopted is to have multiple input and output buffers to give the CPU some maneuvering time:

No alt text provided for this image

In this solution, the input DMA will begin filling input buffer 1. Once that buffer is complete, it will trigger an interrupt to the CPU and begin filling buffer 2 while the CPU processes buffer one. When buffer 2 is also full, it will trigger an interrupt to the CPU and go back to filling buffer 1 while the CPU processes buffer 2, and so on.

In this way the CPU has the whole buffer time (which can be in the range of 10 ms) to process a buffer.

Likewise, on the output stage while the CPU fills buffer 1 the output DMA will stream buffer 2 to the speaker. Once a buffer has been completely depleted, an interrupt is triggered and the two buffers are swapped.

This solution, with two buffers, is also called a "ping pong buffer".

In case we have an extremely overloaded system, in which the CPU might occasionally not make it in time to process a buffer before the next one gets filled (or depleted, in case of output buffers) it's possible to use more than two buffers. The idea is that the CPU is fast enough to process all data within a buffer in time, but it might have some sporadic moments in which it can't respond in time to interrupts. Additional buffers help compensate for that.

Drawbacks of multiple buffers

Let's consider a simple case:

  • sampling frequency: 44.1 kHz
  • stereo signal
  • 2 input buffers (a.k.a. ping pong)
  • 2 output buffers
  • buffer time = 10 ms

From the moment the DMA inserts the first audio sample into an input buffer to the moment the CPU is able to process it we have a latency of 10ms. From the moment the CPU writes the output sample to an output buffer to the moment when the output DMA streams it out we also have 10 ms).

Therefore in this configuration we have an overall latency of 20ms from input signal to output signal.

When input and output buffers increase in number the situation gets even worse, as the total latency becomes:

Latency = (PERIOD_TIME) * (N_INPUT_BUFFERS - 1) + (PERIOD_TIME) * (N_OUTPUT_BUFFERS - 1)

20 ms may not sound like much, but if we are implementing a realtime audio application these delays can become very noticeable. Imagine, for instance, an application which processes a singer's voice in realtime and applies some digital filters: the output signal will be delayed by a small but perfectly hearable delay.

The solutions can only be keeping the number of buffers to a minimum (less than 2 just won't work) and keeping the buffer time as short as possible.

This requires:

  • a powerful CPU, and
  • a system which never gets too busy about doing something else, or
  • a system with realtime extensions which guarantee a high priority task like audio processing never gets delayed by anything else. This may be difficult to fully ensure anyway, because other components in the platform can somehow interfere, causing resource starvation (imagine a parallel process causing a lot of I/O operations on the DMA, like accessing an SSD drive; or a lot of network traffic on a high speed interface - this can choke the memory bus, causing delays which can't be controlled by software).

Audio flow in Linux terms

We have seen many simple concepts which define how audio data are captured, organised, and streamed out. Let's see how these concepts are represented in Linux.

Audio processing in Linux is managed by a framework called ALSA (Advanced Linux Sound Architecture). So more specifically we are going to look at those concepts in ALSA terms.

To map the general concepts to ALSA terms, we can sketch this mapping:

  • An individual capture for an audio stream (or output value to a single channel) is called a sample.
  • Samples are grouped into frames. A frame consists of 1 sample in case of mono audio, 2 samples in case of stereo audio, and so on.
  • Frames are grouped into periods. A period is the entity which, when filled, will trigger an interrupt. In our previous example, a period contains 441 frames (44100 Hz * 10 ms)
  • Periods are grouped into a buffer. In our previous ping-pong example, the buffer contains two periods.

Please notice the discrepancy in terms between the more "physical" view we have seen before and the ALSA terms.

What we called "dual buffer" or "ping pong buffer" solution should be called, in ALSA terms, "dual period", as the term "buffer" refers to the container which holds both.

No alt text provided for this image

In the picture above we can see a graphical representation of our example:

  • A sample contains an individual capture from one channel (left or right)
  • A Frame contains a capture from all channels (left + right)
  • A period groups 441 frames (10 ms @ 44100 hz = 441)
  • The buffer contains two periods

What's next ?

In this article we have explored the basic principles of audio processing. Such concepts apply to any audio processing platform, although the description has become more Linux-esque in the end.

In the next article we will investigate deeper into how these general concepts map to the ALSA world.

You can find the second part here:

Massimiliano Giovagnoli

Software engineer @ Chainguard | Open source

2 年

Very interesting and well-written! I red the first part as of now. Thank you!

回复
Luca Mini

Technical Director presso MECT S.R.L.

2 年

Great Guido...it will be very interesting

回复
Stephan von Zmuda

Supervisor Commissioning at Starrag Technology GmbH

2 年

Sounds very interesting!

回复

要查看或添加评论,请登录

Guido Piasenza的更多文章

  • Realtime audio processing with Linux: part 3

    Realtime audio processing with Linux: part 3

    Introduction In the previous articles we have introduced basic concepts about audio processing and got ourselves…

    3 条评论
  • Realtime audio processing with Linux: part 2

    Realtime audio processing with Linux: part 2

    Introduction In the previous article we have introduced basic concepts about audio processing, and begun to map them to…

    4 条评论
  • 3D visualisation using Qt3D: part 3

    3D visualisation using Qt3D: part 3

    Introduction In this third installement of my Qt3D tutorials we will dig a bit deeper into materials and the impact…

    1 条评论
  • 3D visualisation using Qt3D: part 2

    3D visualisation using Qt3D: part 2

    Introduction In the second part of the tutorial we will dig deeper into the structure used by Qt3D to represent scenes,…

  • 3D visualisation using Qt3D: part 1

    3D visualisation using Qt3D: part 1

    Introduction In my company we are developing a machine vision + AI based assistant which can provide visual hints about…

    6 条评论
  • C++: detailed analysis of the language performance (Part 3)

    C++: detailed analysis of the language performance (Part 3)

    This is the third of a series of articles about C++ perfomance. Did you read part 2 already ? If not, please do so: In…

    2 条评论
  • C++: detailed analysis of the language performance (Part 2)

    C++: detailed analysis of the language performance (Part 2)

    This is the second in a series of articles about C++ performance. Did you read part 1 already ? If not, please do so:…

    3 条评论
  • C++: detailed analysis of the language performance (Part 1)

    C++: detailed analysis of the language performance (Part 1)

    As an avid C++ supporter I frequently had to face several objections to the language, mostly based on its (supposed)…

    16 条评论

社区洞察

其他会员也浏览了