Devices that use some type of perceptual audio encoder have
become ubiquitous in radio and TV broadcasting, as well as in recording
and production studios. For years, many of us have constructed graphs
that depict the measured characteristics of a broadcast system (e.g.,
frequency response, distortion versus frequency, and so forth). Most of
these graphs are drawn with a logarithmic scale on the range and a
linear, piece-wise smooth scale along the domain. While they generally
look nice and are easy to do, they don't express the data in a manner
that is completely relevant to the way human hearing works.
In fact, when our auditory systems analyze sound, our brains do
not treat the audio spectrum as a continuum at all. Our brains
perceive the sounds through 25 distinct critical bands, each of
which has a different bandwidth. At 100Hz, the bandwidth is about
160Hz; at 10kHz it is about 2.5kHz in width.
Figure 1. The masking effects in the frequency
domain. A masker inhibits perception of coexisting signals below the
Although our hearing is very sensitive when detecting single
sounds in isolation, it is not so sensitive when trying to perceive
particular sounds in the presence of many others. One of the most
important characteristics of human hearing is the masking
effect. A loud tone in a particular critical band will mask (i.e., make
inaudible) other softer tones in the same critical band, as shown in
Figure 1. The ear simply will not perceive tones that are below the
masking level; this masking level can be calculated based on frequency
and level of a given tone.
In addition to masking in the frequency domain, the auditory system
also has a masking effect in the time domain. A loud sound actually
effects the perception of quieter sounds not only after it but before
it as well. A softer sound that occurs 15 milliseconds (for example)
prior to a loud one will be masked by the louder sound. This effect is
called backward masking. Not surprisingly, softer sounds that
occur as much as 200 milliseconds after the loud sound will also be
masked. This effect is known as forward masking. Figure 2 shows
Knowing about the characteristics of human hearing, and armed with
ever more powerful DSP chips, design engineers have developed
technology to greatly reduce the overall bit rate needed to represent
audio as a digital data stream. Now let's take a look at data reduction
in the historical context.
If you were to sample two channels of audio at a rate of 44.1kHz,
with 16 bits of resolution, you would generate 1,411kb of data in just
one second. This amount of data far exceeds the transmission capability
of the existing VHF FM band. Furthermore, you would generate more than
10MB of data in one minute, so disc storage quickly becomes an issue
Figure 2. The masking effects in the time domain.
Masking occurs both before and after the masking signal.
The reality is that various data reduction methods have been in
common use long before the proliferation of perceptual audio coders.
Bell telephone's answer to data reduction was 8kHz sampling with eight
bits of resolution. The most common one in broadcasting is the use of a
32kHz sampling rate as opposed to 44.1. (Obviously this fits the analog
model for FM stereo, which itself is data reduced in that the audio
bandwidth is limited to 15kHz.) The second most common one was the
reduction in resolution (also the work of the telephone company) down
to 12 or even 11 bits. With 32kHz sampling and 12 bits of resolution
about a 2:1 reduction in data over the original example is
Aside from the heavy-handed methods just discussed, there are two
other ways to reduce the data rate that are based on data
redundancy and predictive coding. From a historical
standpoint, these methods came next.
The final step in reducing the amount of data needed to represent an
audio signal digitally is the removal of data that correspond to
portions of the sampled audio that are deemed irrelevant. In a
nutshell, this is the method used by perceptual audio coders.
Perceptual audio coders take advantage of the way human hearing
works. Using DSP, the audio is divided into 32 bands. A perceptual
model built in to the algorithm then analyzes the contents of each of
the bands to determine which of the sounds in a particular band are
likely to be masked by the strongest sounds in the same band. The
sounds that will be masked are then discarded. Additionally, the
remaining sound in each of the bands is requantized so that the
quantizing noise is just below the masking threshold.
Further data reduction is accomplished by temporal masking.
The sampled audio is broken up in to blocks, typically 10 milliseconds
in length, and the blocks are then analyzed for temporal maskers.
Taking advantage of both forward and backward masking, some of the
audio data can be discarded because of the presence of temporal
MPEG layers 1, 2 and 3 all use perceptual audio encoding, along with
Dolby's AC3 and Sony's ATRAC (used on Minidisc). Another application of
a perceptual audio coder is Ibiquity's PAC (Perceptual Audio Coder,
developed by Lucent) used in its IBOC DAB scheme. PAC is also used by
XM and Sirius for their satellite-delivered audio services.
While the phrase CD quality may now be as abused by marketers
as the term digital itself, the effectiveness of perceptual
audio encoding (along with 32kHz sampling and elimination of redundant
bits) is remarkable when you consider the data rate reduction. Stereo,
digital audio at a 128kb/s data rate, while not CD quality to my ears,
is nevertheless generally pleasing to the ear. Broadcast engineers need
to maintain their golden ears and high standards, but it must also be
realized that 99 percent of radio listeners are more easily pleased
than the golden-eared listeners.
Figures courtesy of Telos Systems.
Irwin is director of engineering for Clear Channel San