Perceptual audio encoding
Feb 1, 2002 12:00 PM, By Doug Irwin
Devices that use some type of perceptual audio encoder have become ubiquitous in radio and TV broadcasting, as well as in recording and production studios. For years, many of us have constructed graphs that depict the measured characteristics of a broadcast system (e.g., frequency response, distortion versus frequency, and so forth). Most of these graphs are drawn with a logarithmic scale on the range and a linear, piece-wise smooth scale along the domain. While they generally look nice and are easy to do, they don't express the data in a manner that is completely relevant to the way human hearing works.
In fact, when our auditory systems analyze sound, our brains do not treat the audio spectrum as a continuum at all. Our brains perceive the sounds through 25 distinct critical bands, each of which has a different bandwidth. At 100Hz, the bandwidth is about 160Hz; at 10kHz it is about 2.5kHz in width.
Figure 1. The masking effects in the frequency domain. A masker inhibits perception of coexisting signals below the masking threshold.
Although our hearing is very sensitive when detecting single sounds in isolation, it is not so sensitive when trying to perceive particular sounds in the presence of many others. One of the most important characteristics of human hearing is the masking effect. A loud tone in a particular critical band will mask (i.e., make inaudible) other softer tones in the same critical band, as shown in Figure 1. The ear simply will not perceive tones that are below the masking level; this masking level can be calculated based on frequency and level of a given tone.
In addition to masking in the frequency domain, the auditory system also has a masking effect in the time domain. A loud sound actually effects the perception of quieter sounds not only after it but before it as well. A softer sound that occurs 15 milliseconds (for example) prior to a loud one will be masked by the louder sound. This effect is called backward masking. Not surprisingly, softer sounds that occur as much as 200 milliseconds after the loud sound will also be masked. This effect is known as forward masking. Figure 2 shows this effect.
Knowing about the characteristics of human hearing, and armed with ever more powerful DSP chips, design engineers have developed technology to greatly reduce the overall bit rate needed to represent audio as a digital data stream. Now let's take a look at data reduction in the historical context.
If you were to sample two channels of audio at a rate of 44.1kHz, with 16 bits of resolution, you would generate 1,411kb of data in just one second. This amount of data far exceeds the transmission capability of the existing VHF FM band. Furthermore, you would generate more than 10MB of data in one minute, so disc storage quickly becomes an issue (even now).
Figure 2. The masking effects in the time domain. Masking occurs both before and after the masking signal.
The reality is that various data reduction methods have been in common use long before the proliferation of perceptual audio coders. Bell telephone's answer to data reduction was 8kHz sampling with eight bits of resolution. The most common one in broadcasting is the use of a 32kHz sampling rate as opposed to 44.1. (Obviously this fits the analog model for FM stereo, which itself is data reduced in that the audio bandwidth is limited to 15kHz.) The second most common one was the reduction in resolution (also the work of the telephone company) down to 12 or even 11 bits. With 32kHz sampling and 12 bits of resolution about a 2:1 reduction in data over the original example is provided.
Aside from the heavy-handed methods just discussed, there are two other ways to reduce the data rate that are based on data redundancy and predictive coding. From a historical standpoint, these methods came next.
The final step in reducing the amount of data needed to represent an audio signal digitally is the removal of data that correspond to portions of the sampled audio that are deemed irrelevant. In a nutshell, this is the method used by perceptual audio coders.
Perceptual audio coders take advantage of the way human hearing works. Using DSP, the audio is divided into 32 bands. A perceptual model built in to the algorithm then analyzes the contents of each of the bands to determine which of the sounds in a particular band are likely to be masked by the strongest sounds in the same band. The sounds that will be masked are then discarded. Additionally, the remaining sound in each of the bands is requantized so that the quantizing noise is just below the masking threshold.
Further data reduction is accomplished by temporal masking. The sampled audio is broken up in to blocks, typically 10 milliseconds in length, and the blocks are then analyzed for temporal maskers. Taking advantage of both forward and backward masking, some of the audio data can be discarded because of the presence of temporal maskers.
MPEG layers 1, 2 and 3 all use perceptual audio encoding, along with Dolby's AC3 and Sony's ATRAC (used on Minidisc). Another application of a perceptual audio coder is Ibiquity's PAC (Perceptual Audio Coder, developed by Lucent) used in its IBOC DAB scheme. PAC is also used by XM and Sirius for their satellite-delivered audio services.
While the phrase CD quality may now be as abused by marketers as the term digital itself, the effectiveness of perceptual audio encoding (along with 32kHz sampling and elimination of redundant bits) is remarkable when you consider the data rate reduction. Stereo, digital audio at a 128kb/s data rate, while not CD quality to my ears, is nevertheless generally pleasing to the ear. Broadcast engineers need to maintain their golden ears and high standards, but it must also be realized that 99 percent of radio listeners are more easily pleased than the golden-eared listeners.
Figures courtesy of Telos Systems.
Irwin is director of engineering for Clear Channel San Francisco.