During a recent conference call discussing audio sampling rates, the question came up: Why do CDs use a sampling rate of 44.1 kHz?

First, a little background: When you sample an audio waveform, you have a choice as to how many samples you take per second. Over the years, a number of standards have developed; in digital media used for entertainment purposes, the two most common sampling frequencies are 44.1 kHz and 48 kHz.

As a video guy, I think of 48 kHz as the “natural” choice for audio sampling. It is the frequency used in most digital television applications, including DVD and HDTV. It’s an even multiple of the sampling rate used in telephony—8 kHz—so conversions are relatively straightforward.

But most music is sampled at 44.1 kHz, because this is the standard used for CD audio. The question we were asking in my conference call was: Why were CDs standardized around this sampling frequency?

Although we think of 44.1 kHz as an audio standard, the great Internet Oracle says that this magic number was actually derived from the early use of video recorders to record audio. Evidently creating a recorder capable of recording at around 1.4 Mbps—the data rate of uncompressed digital audio—was a difficult feat back in the day, so engineers of that time repurposed analog video recorders in order to record digital audio. If you modulate a digital audio stream in such a manner that you encode three samples of audio on every visible line of video, then you can record audio in real time on a VCR if you sample at exactly 44.1 kHz—or so the story goes.

The math from the FAQ linked above works, with some caveats. Take this excerpt from Digital Interface Handbook (Francis Rumsey and John Watkinson, Third Edition, p. 53):

In 60 Hz video, there are 35 blanked lines, leaving 490 lines per frame, or 245 lines per field for samples. If three samples are stored per line, the sampling rate becomes 60 × 245 × 3 = 44.1 kHz.…

The sampling rate of 44.1 kHz came to be that of the Compact Disc. Even though CD has no video circuitry, the equipment used to make CD masters was originally video based and determined the sampling rate.

This sounds good, except that NTSC video actually runs at 29.97 frames per second, which makes the field rate come out to 59.94 instead of 60. And the sampling rate of 59.94 × 245 × 3 = 44,055.9, so it’s off by just a little. (The math works exactly for PAL video.) But I’m willing to assume that there was a way for engineers to jury rig the VCR to run at exactly 30 fps, and not 29.97, and then the math would come out correctly. (If you can shed more light on this discrepancy, please leave a comment to this post. I’d love to get to the bottom of it.)

Incidentally, this math suggests that three samples were stored as black-and-white pixels. We’re talking about stereo audio here, and presumably even in the 1970s people were sampling at 16 bits per sample, which implies that 12 bytes or 96 bits would have been encoded per line of video.

The whole story reminds me of that email—possibly apocryphal—about how the width of the space shuttle rocket booster is related to Roman war chariots.

But maybe the interesting question isn’t why CDs use a 44.1 kHz sampling rate, but rather why digital video uses 48 kHz. The reason this seems like an interesting question is that there’s less data to compress at lower sampling frequencies. Specifically, 44.1 kHz sampling leads to about 8 percent fewer bytes before compression than 48 kHz does. So you’d expect 44.1 kHz audio to be more widely used in digital video, because it should be able to deliver the “CD experience” at a lower overall data rate.

Because of the Nyquist theorem, we know that the maximum frequency that can be represented at any given sampling rate is half the sampling rate; thus a 44.1 kHz CD can capture tones up to 22.05 kHz, while a 48 kHz DVD can capture tones up to 24 kHz. The limit of human hearing is roughly 20 kHz, so in a theoretical, spherical-cow world, it seems like both capture standards would meet the requirement of fully capturing the entire audible spectrum.

In the real world, of course, cows aren’t spherical. In practice there are aliasing artifacts near the limit of the filter, with less computationally complex filters having worse aliasing. So the point of the 48 kHz sampling rate used in digital video is to buy enough headroom for simple filters to operate without introducing audible artifacts.

Still, these standards were written a relatively long time ago. Today we’ve had several more turns of Moore’s law. So maybe a capacity-constrained network operator might want to consider jumping to 44.1 kHz audio sampling, at the cost of a little more filtering logic in the decoder.

Howdy Pierce, managing partner and Cardinal Peak co-founder, is a “video guy” whose technical background is in multimedia systems, software engineering and operating systems. Read more about Cardinal Peak’s digital video expertise.