Tuesday, January 12, 2010

The AudioFile: Understanding MP3 compression

From anonymity to ubiquity



Since its standardization in 1991, MP3 has gone from being a little-known portion of a video file format to the kind of ubiquity that most brands can only dream of having. It's both widespread, with small players flying off the shelves, and controversial, dropping from the lips of politicians and advocates for all sides of the intellectual property debate.
But what is MP3? The usual explanations usually take one of two forms. The long version, available in technical papers, is written in jargon and filled with math. The short version, often used by newspapers and nontechnical periodicals, simply states that the process eliminates parts of sound not normally heard by the human ear. But this one-sentence description raises more questions than it answers for any reasonably tech-savvy reader: how does it find those unheard sounds, and how does it get rid of them? What's the difference between the different bit rates and quality levels? If you're anything like me, you've often wanted to know the mechanics of MP3, but not to the point of writing your own encoder.
This guide attempts to explain the process of MP3 compression in simple terms, without oversimplifying it. Although some parts have been omitted, like the details of stereo encoding schemes and in-depth file composition, it covers the basic theory of turning uncompressed sound files into compressed MP3. In order to tour the MP3 codec without getting overwhelmed by the technical minutiae, we'll take a look at some of the background principles and legacy of MP3, then break the process down into analysis and compression before finally considering the impact that this humble format has had on digital audio.

Hear, hear

Depending on the number of concerts you've attended, your ears may be more or less healthy for your age. But even if they're in perfect shape, human hearing is constrained by a number of limitations. At best, tests have usually shown that we can hear frequencies in a range between 20 to 20,000Hz. Our ears are also most sensitive between 2KHz and 5KHz, and they can detect changes between frequencies in increments of 2Hz—that's the effective "resolution" of hearing. As the average person gets older or the delicate cells of the ear are damaged by loud noise, high-frequency perception is reduced. In fact, most adults (myself included) have trouble hearing above 16KHz.
And these are just the physical limitations of the human ear. Our brains also play a role in filtering and analyzing the signals sent by the auditory nerve. The science of how we perceive sound is called psychoacoustics, and it has discovered a number of useful auditory effects. For example, one of my favorites is the Haas effect, which states that two identical sounds arriving within 30-40ms of each other from different directions will be perceived as a single sound coming from the direction of the first. It's often used in public address systems to reinforce the sound "from the stage," even if the loudspeakers are located farther to the side. MP3, like many other lossy audio compression schemes, relies heavily on these kinds of psychoacoustic effects to work its magic. In particular, it exploits the phenomenon of frequency masking.
Imagine two sounds with similar frequency profiles—say at 100Hz and 110Hz—but with different volume levels. If played by itself, the weaker sound is perfectly audible, but only the stronger will be heard if both are played simultaneously. The process of covering one frequency with another close (but not identical) frequency is called "masking." The degree to which frequencies can mask each other varies across the range of human hearing—our ears are less precise at the top and bottom of the audible spectrum. Loud transient signals (ones with very short duration) can also mask weaker signals for a short time, similar to the Haas effect. This type of masking is known as "temporal" masking and is also used in MP3 compression.

Leftovers


Something else to keep in mind while looking at the techniques of MP3 is that it continues a compression legacy that has influenced its design. MP3 actually stands for "MPEG-1 Audio Layer 3." MPEG, in turn, stands for the Moving Pictures Expert Group, which created the standard. MPEG video (and its successors, MPEG-2 and MPEG-4) is used all around us—DVDs are a modified version of MPEG-2, as is your digital TV signal.

As Layer 3 of the MPEG-1 specification, there are obviously two previous audio layers before MP3, which did not catch on in the consumer market (few of us listen to MP2s at home). There are several features of MP3 which may seem to be pointlessly complicated or are implemented in more steps than would seem strictly necessary, and these are often holdovers from the old design. This legacy means that MP3 is not actually terribly elegant or streamlined.
Which is a great excuse for me as an author, honestly. So if you have trouble following the process laid out in this article, don't blame me for a poor explanation. Blame Layer 2 instead.

Analyze this!

So let's start with a regular sound file stored in uncompressed PCM audio. We'll assume for the sake of argument that both it and the MP3 are going to operate using the same sample rate—MP3 can support several rates but typically uses CD-standard 44.1KHz.
The first step is to group these samples into "frames," each of which contains 1152 samples. Why 1152? It's another accommodation to backwards compatibility with Layer 2. Technically, Layer 3 frames are split into two "granules" of 576 samples. This is kind of a kludge and one that was simplified in newer encoders: when the MPEG-2 video standard was created, its audio layer uses only one of these granules per frame. For the purposes of encoding, MP3 really only acts on a single granule at a time, although it may use parts of the previous and following granules in order to get a wider viewpoint of change over time.
Now those samples are run through a filterbank that divides the sound into a set of 32 frequency ranges (in other audio applications, we call this a "bandpass filter," since it lets only a specific band of frequencies through). This is another concession to Layer 2, which actually used those 32 values for its encoding. One of the values of Layer 3 is that it subsequently divides those 32 frequency bands by a factor of 18, creating 576 smaller, adaptive bands. Each of these bands, therefore, contains 1/576th of the frequency range from the original samples.
At this stage, a set of two parallel processes takes place: the Modified Discrete Cosine Transform (MDCT) and Fast Fourier Transforms (FFT). The math for these is complicated, but their functions can be explained without having to show our work.
The FFTs are used as analysis functions, turning each frequency band into information that can be fed into the encoder's psychoacoustic model—a kind of virtual human ear. The encoder uses that model to answer a few questions: are there sounds in each band below the masking threshold (they will be hidden by louder sounds at close frequencies)? Is the audio fairly constant, or does it change? Are there any sharp transient sounds that need to be preserved and which might mask other transients just before or after? This information will be used during the compression to figure out which information can be safely discounted since (according to the masking behavior of the psychoacoustic model) our ears would ignore it anyway.
Before going into the MDCT on the other side of the parallel process, the samples are sorted into different "window" patterns based on whether they contained steady or constant noise. MP3 allows frequency bands to be described using either one long window or three short windows. Constant noise without much change over time can be expressed using the long window. Transient noises, like drum hits or vocal consonants, are described across three short windows (each containing 192 samples, or about 4 milliseconds).
The MDCT turns each windowed band into a set of spectral values. Unlike the initial audio, which represents sound as the position of a waveform over regularly collected samples, spectral analysis looks at sound as energy across the range of frequencies.




In this spectral view of a sound file, frequencies with more energy are shown as brighter patches. The lowest frequencies are at the bottom, and the highest at the top. Time moves from left to right.

Because spectral information bears more of a resemblance to the way our hearing interprets audio, many compressed audio encoders use it to remove the psychoacoustic information instead of operating on the sampled waveform. Once the MDCT finishes with its math, the MP3 process has 576 "frequency bins" to work with, each containing the spectral intensity for 1/576th of the total frequency range.
Now that the encoder has both the spectral information and the psychoacoustic analysis of the granule, it starts the actual compression process.

Cold compress

MP3 relies on two layers of compression, only one of which utilizes the psychoacoustic analysis. Psychoacoustic compression is best at reducing complicated sounds with lots of mixed components, because they provide plenty of masking opportunities. Simpler sounds do not benefit much from the psychoacoustic effects—but they can be easily compressed using more traditional data techniques. Combining both approaches requires a two-step process of quantization and Huffman coding, both of which feed back on each other to provide MP3 with its impressive bandwidth flexibility.
The word "quantization" sounds complicated, but what it boils down to is the process of assigning a numerical value to something—giving it a quantity, in other words. "But we already have numbers," you might protest. "In fact, that's all we've been using all along!" This is true. But MP3 wouldn't be much of a compression method if it just converted some numbers wholesale from one form to another.
Instead, the 576 post-MDCT frequency bins are sorted into 22 scalefactor bands. By dividing the values within the band by a given number (the quantizer) and rounding, a smaller approximation of the original is reached, but some information is lost during the rounding step. The quantization is the same across the entire frequency spectrum, but individual bands can be scaled up or down (hence the scalefactor) for more or less precision. Bigger numbers lose less information from the quantization process, while smaller numbers are more likely to suffer from rounding error. During the decoding, this scaling will be reversed so that the signal isn't any louder or softer.
The job of all those Fast Fourier Transforms (FFTs) in the psychoacoustic model is to be able to specify how much precision is needed in a given scalefactor band. If there are weak signals that will be masked by stronger sounds in the band, the signal can be scaled down so that those signals are effectively truncated off. Unfortunately, the rounding errors also have audible effects (noise, basically), and the more a signal is reduced, the greater that noise will be when it is scaled back to its normal amplitude. So the second task of the psychoacoustic model is to check whether the signal-to-noise ratio has become perceptually unfavorable. If so, the encoder goes back to the quantization step and increases the scalefactor, increasing precision while reducing noise.
This can be confusing, so let's use another hypothetical example. Let's say that our uncompressed scalefactor band information is represented by the number 12,592. We might quantize it by dividing by the number 100. So with a scalefactor of 1.0 (no change), we store this number as 126, and when we restore it during uncompression (multiplying it by the same quantization factor), we'll end up with 12,600. We sacrificed some precision and added a bit of noise—our number differs from the original by 8—but it's pretty close.
However, if we were willing to put up with a little less precision and a little more noise, we could scale our original input by .1 to 1,259. Now it is quantized down to 13. In restoring the number, we apply both the quantization and undo the scale factor, ending up with a value of 13,000 (13 * 100 / .1 = 13,000). Now we're off by a bit more, but maybe not enough to notice, depending on the context. And those shorter storage values take up less room in a file, particularly when combined with the next stage in the process.
At the same time that it is performing quantization and scaling, an MP3 encoder is also using Huffman coding to turn the information in the scalefactor band into shorter binary strings. Huffman coding (named for the MIT Ph.D. candidate who developed it in 1952) is kind of like playing "twenty questions" in binary. Starting at the top of a tree containing all the possible answers, the computer moves down through the branches, each of which is identified by either a 1 or a 0 (binary numbers). Answers are located at the end of a branch, so once the computer reaches a valid result, it can stop moving through the tree.




The most common answers are placed close to the top (as in the illustration, which codes the most common seven letters in English). The advantage of Huffman coding is that because the algorithm stops moving down the tree as soon as it finds an answer, common answers can be turned into very short numbers (in the hypothetical example above, if the code starts with a 1 instead of a zero, we write 'E' and immediately move onto the next sequence).
These codes can be shown visually as a tree. But to a computer program, they're simply a table, and once the computer finds a match, it moves on.





E
1
T
00
A
0111
O
0110
N
0101
R
01001
I
01000


MP3 uses a number of Huffman tables to encode quantized values, and can choose different tables to represent different scalefactor bands. Smaller numbers (those which are less precise, in other words) are located at the top of the tables. Using these Huffman tables, the encoder adjusts its quality to match the chosen bit rate. If, at the end of the quantization step, it finds that the block of coded bits is longer than the allotted bit rate for that granule, it once again goes back and adjusts the scalefactor bands for lower precision (thus yielding smaller numbers that translate to shorter Huffman codes). On the other hand, if the encoder comes in under the allowed bit rate, it can go back and amplify selected frequencies for higher precision during quantization, hopefully filling out the granule.
That's basically all it takes to encode a granule. Each granule contains two parts required for reconstructing the audio: the scalefactors for each band and the long chunk of Huffman bits. But as we noted earlier, MP3 is based on frames, each of which contains two granules. After completing two granules, the encoder combines them into a single frame for transmission. MP3 files are very simply constructed, with many frames of audio data following the ID3 metadata (containing all the useful information like artist name, song title, and genre) at the front of the file.
As it turns out, the frame also contains a lot of other helpful information in its header segment. For example, there's a synchronization word that starts each frame, which is why decoders can still play a broken or partially-damaged file. After the sync word, the frame also includes a bit rate code, enabling some frames to use higher bitrates if they contained more information (these are the Variable Bit Rate, or VBR, files that you may have seen). Finally, there's information in the frame header about the file format (is it stereo or mono?), its sample rate, and whether or not the file contains copyrighted material.
Now that we've covered the process of turning digital sound into MP3, you may be wondering how it gets back out. Although the structure of a decoder is not exactly the reverse of its encoding counterpart, it is close. For decoding, the Huffman bits are turned back into quantized information, scaled back to the original level, and then recombined with another into PCM samples that can be played by the average sound card.



Sea change


The move to MP3 has had an almost unparalleled impact on digital audio. Not only has it become the force behind those ubiquitous earbuds seen on many a sidewalk, but MP3 was also a prime factor in the debate over intellectual property and piracy. Sure, geeks may once have traded music in large WAV or AIFF files through newsgroups, but it was Napster's easy MP3 distribution system that became one of the trademark skirmishes between the music industry and consumers. Simultaneously, MP3 is an enabling factor behind the new online distribution systems that have democratized (to some extent) music and radio through MySpace and the universe of podcasts.
Is it possible that these events would have taken place without MP3? Almost certainly. Eventually, someone would have popularized another method of making audio files smaller, or increasing bandwidth might have negated the problem. Peer-to-peer networks might have been delayed in their development, but they would have still appeared to take advantage of digital reproduction. That said, it's likely that the format that took the place of MP3 would have probably used many of the same basic principles, including psychoacoustic modeling and data compression.
Indeed, it's striking how many of the crucial elements of MP3 date back to theoretical work more than half a century old. Huffman encoding dates back to 1952. Claude Shannon's groundbreaking work on lossy compression and signal-to-noise ratios was written during the second World War. Research into the limitations of human hearing and perception, such as the Haas effect and equal-loudness contours, is even older. Harry Nyquist's sampling law was originally devised in 1927 and was meant to describe telegraph transmissions! In many ways, MP3 is a very old technology being applied in a modern form.
And yet MP3 has managed to maintain its foothold in consumer audio, even as other (technically superior) formats have been introduced. Although newer kinds of files, both proprietary and open, have become more common—particularly with the advent of digital rights management—MP3 still remains the standard for audio on the Internet, and that shows no real signs of changing. Although it may be disdained by lossless and analog aficionados for what it discards, it has changed the way that we think about music and audio in a very fundamental way.

1 comment:

Post a Comment