Archive for the ‘Howdy’ Category
Did the Manhattan Transfer use Auto-Tune?
I recently came across an allegation on Amazon.com that got me thinking. The review in question is by Andrew Grobengieser, and it is critical of the Manhattan Transfer’s latest album, The Chick Corea Songbook. Grobengieser alleges:
As a lifetime fan, I was unbelievably excited to hear of the release of a Chick Corea songbook. And then I listened. It only took me a moment before a sinking feeling set in, as I realized that ManTran, one of the best-blending and most in-tune vocal ensembles in recorded-music history, has succumbed to the scourge of modern recording known as “Auto-Tune”. Yes, Manhattan Transfer fans, welcome to the world GLEE and Cher. It’s all over the place on group harmonies, and even rears its ugly head on a few of the solo vocals.
I mean, really. Why ON EARTH would this production choice be made? It takes what are otherwise very hip and adventuresome arrangements, and makes them roboticized, metallic, cold, and inhuman.
It seems to me that it’s one thing to allege that a weekly TV musical is using Auto-Tune, but quite another to level the accusation at four vocal jazz icons.
I am by no means anything approaching an expert on this topic—just an interested fan. But the engineer in me was curious: Is it actually possible to detect the use of Auto-Tune?
First, I did a little background research. Auto-Tune is a tool that can be used to correct the pitch of recorded singing. Evidently it can be used in a subtle or blatant manner; Andy Hildebrand, the inventor of Auto-Tune, says:
At one extreme, Auto-Tune can be used very gently to nudge a note more accurately into tune. In these applications, it is impossible for skilled producers, musicians, or algorithms to determine that Auto-Tune has been used. On the other hand, when used as an effect, such as in hip-hop, Auto-Tune usage is obvious to all. Everything in between is subject to an individual’s unique listening skills.
This raises the question: Assuming that the Manhattan Transfer is attempting to use Auto-Tune in a subtle manner, how can Grobengieser detect its use? (In a follow-up comment to his review, he claims he is “a trained musician with years of experience dealing with vocal group intonation.”) Frankly, I didn’t believe he could detect it, so I decided to try to learn more.
According to one site:
The most important parameter is the retune speed – the time it takes Auto-Tune to glide the note to its perfect pitch. For maximum realism, the retune speed must be set to a value close to the retune speed of the singer’s natural voice. . . . But Auto-Tune’s retune speed can be set to any value right down to zero, which means that notes instantly jump to the exact pitch. This effect is decidedly un-natural. If the singer glides smoothly from one note to another, Auto-Tune will suddenly jump from one note to the next when the mid-point between them is reached.
I believe you can hear the un-natural Auto-Tune effect with a zero retune speed in this Cher song, which according to various web sources also seems to be the first use of Auto-Tune as a sound effect (in 1998).
But let’s assume that the Manhattan Transfer is trying to hide the use of Auto-Tune, in which case their recording engineer would presumably use a retune speed that approximates a “natural” value.
Hildebrand’s original patent for Auto-Tune, also from 1998, has a relatively clear explanation of his invention and how it works. (In my experience, the technical clarity is unusual for a patent!) If you’re interested, I recommend the discussion from the middle of column 3 to the middle of column 6.
I wondered if possibly we could detect Auto-Tune because the notes would be too perfect. The song “500 Miles High” begins with an a capella intro in which it is easy to isolate the first note sung by Janis Siegel. I brought this song into Audacity and zoomed in to the first one second of the left channel, and then selected Analyze > Plot Spectrum.
This is a reasonably crude method, but if you can use it at a point in the music where you can isolate a single voice, it can show some interesting information. Above, if I’m remembering my music theory class correctly, you can see that Siegel is singing an “A”. You can see the fundamental in the first peak, which is highlighted with the thin vertical line in the screenshot above. To the right are all the harmonics.
As you can see, the plot shows that Siegel didn’t hit a perfect “A”—that would have been at 220 Hz. Instead, she’s at 216 Hz, which would be noticeably flat. I am definitely no expert, but I’m thinking that if you’re going to use Auto-Tune, why not get the note correct?
There is a similar intro to the Manhattan Transfer song “Gentleman With a Family” from 1991’s The Offbeat of Avenues. I picked this song because it starts out similarly to “500 Miles,” and also because 1991 puts it well before Auto-Tune would have been in use. In this case, the intro isn’t a capella, so there is some instrumentation playing and thus it’s a little harder to isolate just the singer’s voice. However, selecting the left channel from 20.5 to 21.5 seconds in this song yields the following frequency analysis:
I am pretty certain that the highlight is again on the fundamental of Siegel’s voice—she is hitting a C at 262 Hz. (I believe the peaks to the left are lower tones from the instruments.) Here, before the days of Auto-Tune, she’s dead-on. Of course, she was also 19 years younger!
There are many more sophisticated methods of analysis that suggest themselves. It would be interesting to plot the frequencies over time—perhaps a voice held on a long note without any variation would be a likely indicator of the use of Auto-Tune. If we could isolate each singer onto a separate voice track, it would even be possible to run the pitch detection portion of the Auto-Tune algorithm; if this indicated that tuning was necessary, it would probably be a good clue that Auto-Tune wasn’t used in the studio. My colleague Kevin Gross suggested looking at the vibrato and timbre, because vibrato is removed altogether by Auto-Tune (and then artificial vibrato is usually added back in, according to the patent), and timbre would be changed when samples are added or dropped as part of the tuning process.
Obviously, I can’t really conclude anything from what I’ve done so far. In his Amazon review, Grobengieser doesn’t specify where he thinks he hears Auto-Tune on The Chick Corea Songbook; possibly he’s not talking about the intro to “500 Miles”. Or possibly my analysis tools are not sophisticated enough to detect the use of Auto-Tune. Or possibly if you are an audio engineer trying to sneak a little Auto-Tune into a jazz recording, you are smart enough not to correct to the exact pitch. I have no idea. To my ears the Transfer occasionally sounds just a little off-key on this album, which I ascribe to their age (but it also argues against the use of Auto-Tune). Again, though, I’m no expert.
I’d welcome your thoughts in the comments!
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
Sniffing iPad Traffic
For a project I’m working on, I was wondering how a particular video-related feature on Apple’s new iPad works. In order to figure that out, I thought it would be interesting to connect a network sniffer in-line with my shiny new iPad, so I could capture and analyze all the network traffic flowing to and from the device.
Although I did this with the iPad, the technique below is not specific to it; you could use the approach below to capture network traffic to any Wi-Fi-enabled mobile device, like an iPod Touch or a Palm Pre.
An easy way to do this is to configure a computer to serve as a bridge between an Ethernet network and an ad-hoc Wi-Fi network. Then, by running Wireshark or another network sniffer on the computer, you can capture the packets as they flow through to the mobile device on Wi-Fi.
My computer is a MacBook Pro running OS/X 10.6 “Snow Leopard”, but the same concept should work on Windows or on earlier OS/X versions, although the dialogs might look a little different. There are three steps:
- Configure the computer to act as a Wi-Fi Bridge
- Connect the iPad to the computer’s ad-hoc Wi-Fi network
- Capture the packets
Step 1: Configure OS/X as a Wi-Fi Bridge
First, we need to configure OS/X as a Wi-Fi bridge. To do this, select “Create Network…” from the Airport drop-down menu. This dialog appears:
Type a network name, and, if you like, assign a password. I assigned a password just so I could ensure that only one device was connecting to my bridged Mac. We are nerds here at Cardinal Peak, so we tend to have a lot of devices floating around our office!
At this point, the iPad would be able to connect to the computer, but the computer is not yet configured to bridge the packets from the 802.11 network onto the Ethernet network. To configure bridging on OS/X, you need to turn on what Apple calls “Internet Sharing”. Go to System Preferences and select the “Sharing” option. Turn on Internet Sharing, and set it up to “Share your connection from” “Ethernet”, “To computers using” “AirPort”:
Step 2: Connect the iPad to the ad-hoc Wi-Fi network
Next, you’ll need to configure the iPad to connect to the ad-hoc Wi-Fi network you just created. This is pretty easy: Go to Settings, and then Wi-Fi. You should see your new ad-hoc network in the list—in my case, I’m looking for “HowdysNetwork”:
Just tap on the ad-hoc network. If you elected to use a password, you’ll be prompted for it.
You can confirm your iPad’s network configuration by tapping the right arrow next to the network name:
Good—we have an IP address, but more importantly we have reasonable entries for Router and DNS server, as well.
Next, you should test out your bridged network connection by bringing up Safari on the iPad and proving you can visit a web site.
Step 3: Capture the Packets
The final step is to start up Wireshark on your computer and attach to the Wi-Fi interface. You normally need to start Wireshark as the super-user in order to have enough rights to capture traffic. There’s probably a cool way to do this graphically, but being an old-school Unix guy, I always bring up a Terminal window and type sudo wireshark &.
We want to capture packets on the Wi-Fi interface, which on my Mac is device en1. Click the leftmost button on the Wireshark toolbar, and then click “Start” next to device en1:
Now you should be all set—do something on your iPad to cause network traffic, and confirm that you see it showing up in the Wireshark window!
Encoders Aren’t Commodities
My partner Ben Mesander had a really cool post the other day: An h.264 encoder written in 30 lines of C code.
Ben’s encoder outputs completely valid h.264, but it doesn’t actually compress anything. (What do you expect from 30 lines!) In fact, because of the necessary h.264 headers, the output of Ben’s encoder is larger than the input.
This is a dramatic example of something that I find interesting about the codec marketplace: Decoders are commodities, but encoders are highly differentiated. People often mis-understand this dynamic, however.
A video decoder, if it works, has to follow the relevant specification. There are hundreds of “tricks” that a baseline profile h.264 encoder could use, and so a baseline-profile decoder must be able to handle all of them. So there’s really not room for a lot of differentiation between h.264 decoders. Sure, one decoder might use less CPU than another. But mostly, if you’re looking to buy a decoder, you should shop based on price.
Another way to say the same thing is that a codec specification details how to write a decoder. The spec lays out what a compliant bitstream looks like, and specifies how you turn that bitstream into video or audio.
Encoders, as Ben showed, are completely different beasts. An encoder author can pick which of the tools provided by the standard he or she will use. In the extreme case, as Ben did, he can choose to use almost none of the tools. Therefore, there can be a huge difference in compression efficiency—and thus video quality—between two encoders.
You might think this is obvious, but if so you should walk around the security industry’s ISC West trade show this week. You will find all sorts of vendors claiming that their h.264 DVR is the same as their competitor’s DVR, or claiming that their h.264 IP camera is better than a MPEG-4 IP camera. Maybe so, and maybe not: Just because h.264 is a more modern and complex codec than MPEG-4 part 2, it doesn’t automatically follow that a particular h.264 encoder is better than a particular MPEG-4 encoder.
Ultimately, the only way to compare two encoders is a head-to-head bakeoff, where each encoder is set to the same data rate and fed the same content, and you view decoded video from the two at the same time.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
More on Patents
I had intended to give the indemnification issue a rest. But then the following caught my attention this morning:
One big difference between patents and other kinds of intellectual property, like copyrights and trademarks, is that patent-holders who want to sue someone for infringement don’t have to show that their patents or their products were actually copied by the defendant. In fact, the issue of copying is legally irrelevant when determining whether or not someone infringed a patent. (It is relevant to willfulness—more on that below.) The flip side of that rule is that a defendant company can have a really nice story about they did their own research, invention, and development—but it doesn’t matter one bit, legally speaking. Such “independent invention” stories are no defense.
“No one seems to know whether patent infringement defendants are in fact unscrupulous copyists or independent developers,” writes Lemley. So he and his partner went on a hunt looking for copycats in patent disputes. How much copying did they find? Not much at all.
(Joe Mullin’s whole post is excellent; thanks to Brad Feld for calling attention to it.)
Which underscores my earlier point: Patent lawsuits don’t usually arise because of unethical behavior on the part of the engineering team. And therefore offering indemnity protection against these kinds of cases is not a financial risk that we can or should bear.
I’m not primarily out to agitate for reform of the patent system, but I agree with calls for adding an independent innovation defense. Such a reform would help swing the effect of the patent system back toward its original intention, which was to encourage innovation.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
Providing Indemnification for Patent Infringement
Cardinal Peak recently had an unfortunate “first”: We chose to walk away from a promising engineering engagement because we couldn’t reach agreement with our customer about an indemnification clause.
Let me give a little background before diving into the issue. “Indemnification” technically is the legal obligation to compensate a business partner for losses that the partner might suffer during the performance of a contract.
To give an example, it’s common for professional services providers such as Cardinal Peak to indemnify our customer in the event of a lawsuit alleging that we copied a third party’s source code into the product we engineered for the customer. In this case, if the third party were to sue the customer for copyright infringement, then Cardinal Peak will be responsible for both the costs of the legal defense, plus the costs of any settlement that would arise. You can read a little more background (albeit with a testing-oriented slant) here.
Not to suggest that we take any financial risk lightly, but we are generally okay with providing this type of indemnity protection to our customers, because we control our own actions and we’re confident we won’t behave unprofessionally.
In this case, however, our customer insisted we provide indemnification in the case where we were to unwittingly infringe a third party’s patent. In my view, this is a very risky form of indemnification to offer.
I am aware of too many examples where existing patents cover relatively obvious methods of solving some particular problem. (A portion of our services business involves expert services work in support of patent litigation—so we’re all too familiar with the ugly details, although we can’t speak publically about the cases we’ve worked on.) It seems all too possible that, without any foreknowledge of an existing patent, a competent engineer might independently hit on an already-patented method for solving a problem. And if this were to be the case, the amount of money required to mount even a trivial defense would put a small firm like Cardinal Peak out of business.
There’s also a quirk of the US patent system that is worth pointing out: To my understanding, it is not even theoretically possible to guarantee a priori that you don’t infringe a patent. Even with a complete patent search and the best legal advice, under our system, the only way you can determine infringement is through a jury trial.
So really there’s no way for a firm like Cardinal Peak to be absolutely, 100% certain we don’t infringe a patent.
One way to look at this is that there is inherent financial risk in releasing any high-tech product. The form of risk that first comes to mind is when a high-tech company invests a lot of money in developing a product, just to see it flop when it comes to market.
But market failure isn’t the only form of risk in product development: Liability for unwitting patent infringement is another. (Unfortunately, it seems to be a growing risk!) Just like market risk, patent infringement liability is a risk that should be borne by the company that also stands to reap the rewards from a successful product—and that’s usually not the provider of your engineering services. After all, rarely will the service provider own the intellectual property that results from their work for hire.
I’m not suggesting that Cardinal Peak would never offer indemnification for unwittingly infringing someone’s patent. But there would have to be some upside. It is a risky move for us, and we would want to see some extra return that justifies taking that risk, whether that would be in the form of royalty participation in the product, or IP ownership of whatever is developed.
What do you think? This is an issue that my partners and I continue to discuss, so we welcome your comments.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
The Cost of an Engineer-Hour
As all good project managers know, there are three dimensions to any engineering effort:
- The features of the product: What does the product do and how does it look? (For sake of simplicity, let’s include “quality” as a product feature.)
- The schedule on which the product is produced: How fast does it get to market?
- The cost of producing the product: How much money does your company have to invest in bringing the product to market?
Before we started Cardinal Peak, my partners and I had all spent many years as engineering managers inside of various companies. To me, it always seemed that I was most visibly measured on the first two dimensions: features and schedule.
To the extent that I was measured on the cost axis, the companies I worked for would normally use “number of heads” as a loose proxy for “dollars spent to bring this product to market”. But I had a lot less control over the number of heads assigned to my team than I did over the other two dimensions, and as a result I believe that perceptions of my success were influenced almost entirely by how my team met our goals in the first two dimensions.
So I guess it’s not surprising that back then I only had a vague idea about how to assign a dollar value to any engineering team’s most precious resource: the engineer-hour.
The somewhat embarrassing truth is that previously I didn’t think about dollar costs too much. I had heard from other managers that an engineer cost my company about $220,000 per year, once you accounted for all the indirect costs like benefits, IT, rent, HR and administrative support. (This specific example comes from Silicon Valley in about 1998, in a relatively high-benefit, high-support environment.)
But I couldn’t really derive the annual engineer cost from any hard data, because I wasn’t directly responsible for any of those expenditures. (I did have a degree of influence over a new hire’s starting base salary. After that, however, everything was relatively programmed: Raises were given within parameters set by the HR department and company management, as was the bonus program.)
I don’t want to make the mistake of projecting my own personal naïveté onto others! But this conversation comes up occasionally, usually from engineering managers for whom it is self-evident that the cost of internal hiring is so much lower than the cost of hiring a firm like Cardinal Peak.
So on an airplane flight recently, I pulled together this spreadsheet, which you can use to compute your own internal cost for an engineer-hour.
I have pre-filled certain cells with reasonable values for a relatively low overhead, mid-benefit engineering team located in Colorado. I’ve tried to model the compensation package given to a competent senior engineer who has 8-12 years of experience. This isn’t your team lead, but neither is it the young gal you just hired with a Master’s degree. Instead, I’m trying to model the engineer who will form the backbone of your team—the workhorse who will show up on time, focus on work, and reliably crank out features.
Your situation may be different, so feel free to play around with the cell values to see what your own costs are.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
On the Importance of Encrypting Video
This morning brought a front-page Wall St. Journal article that’s a bit of a jaw-dropper:
Militants in Iraq have used $26 off-the-shelf software to intercept live video feeds from U.S. Predator drones, potentially providing them with information they need to evade or monitor U.S. military operations.
…
The potential drone vulnerability lies in an unencrypted downlink between the unmanned craft and ground control. The U.S. government has known about the flaw since the U.S. campaign in Bosnia in the 1990s, current and former officials said. But the Pentagon assumed local adversaries wouldn’t know how to exploit it, the officials said.
After the Journal article, the Pentagon quickly let it be known that the problem has been fixed. But I’m stunned that it could have happened in the first place.
At one point, I had heard that video from Predator drones was transmitted as unencrypted analog NTSC video, with geo-spatial metadata encoded into the closed-captioning portion of the data stream following this specification—basically an industrial form of those annoyingly-advertised X10 wireless cameras. But I had assumed that the Pentagon would have long since upgraded the system to digital video with some reasonable form of encryption. I guess someone needs to do a little reading on the pros and cons of security through obscurity.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
Mobile Lounges and Development Methodology
I recently flew into Washington Dulles airport, and I was struck anew with the sheer impracticality of the Dulles “mobile lounge.” It’s like something out of Star Wars. I can only imagine the glee of the Chrysler salesman who, in 1962, sold the Dulles airport authority a fleet of these whimsical things.
You may think that modernist-era people movers have very little to do with embedded engineering. You’re basically right—but stick with me, because there’s a worthwhile point here.
Mobile lounges are 54 feet long by 16 feet wide, travel at a top speed of 25 mph, and can move just over 100 passengers. Today, 19 mobile lounges are used to connect the Dulles main terminal with some of the other terminals. When a mobile lounge nears its dock, the entire passenger compartment scissors up, which is so much easier than building a ramp, I guess.

I didn’t have the presence of mind to snap my own picture of a mobile lounge, but Flickr user Afagen has graciously shared this photo under a Creative Commons license.
The whole thing isn’t sensible, on so many levels. Hard cost data is difficult to come by, but just think about it:
- The upfront engineering effort must have been nontrivial; mobile lounges, with their scissor lifting mechanism and odd size, are not particularly similar to anything else Chrysler was likely to have had lying around in the late 1950s.
- The per-lounge cost, especially in low volume, must be pretty high.
- Maintenance on these beasts would be a specialized skill, and spare parts must be hard to come by, especially after so many years.
Why mobile lounges?
Evidently, Eero Saarinen—the great modernist architect who designed Dulles airport, JFK airport and the St. Louis arch—is to blame for the mobile lounge. Saarinen wanted to do away with the long distances that people have to move through large airports, and he didn’t like the cost of the finger-like “jet bridges” found in typical airports, either. He conceived of the mobile lounge as a way to move people from a central point directly to their aircraft while affording airport management several operational benefits.
Of course, a system of ordinary buses could have met the same goals. According to the same book quoted above, in the early 1960s, some European airports used buses to take passengers to their airplanes, but Saarinen didn’t like the bus approach because it required passengers to climb stairs, and it exposed them to the weather and the loud sound of airplane engines.
An alternate approach
I’d guess that in the early 1960s, it would have been possible to design a covered, portable escalator that could whisk passengers from the door of a bus up to an airplane door—far less expensively than designing the brand-new mobile lounge.
Think of it as a choice between an evolutionary and revolutionary approach. The evolutionary approach would have been to take a well-understood, de facto standard—the city bus—and extend it with a new accessory, like a portable escalator. But Saarinen chose the revolutionary approach, and he ended up reinventing the wheel, badly.
Which brings us around to the embedded engineering space. Different engineering teams have different cultures. Joel Spolsky recently stirred up a hornets’ nest of trouble when he suggested that the best programmers were those who ignore grandiose theory in favor of a make-it-happen approach:
Jamie Zawinski is what I would call a duct-tape programmer. And I say that with a great deal of respect. He is the kind of programmer who is hard at work building the future, and making useful things so that people can do stuff. He is the guy you want on your team building go-carts, because he has two favorite tools: duct tape and WD-40. And he will wield them elegantly even as your go-cart is careening down the hill at a mile a minute. This will happen while other programmers are still at the starting line arguing over whether to use titanium or some kind of space-age composite material that Boeing is using in the 787 Dreamliner.
When you are done, you might have a messy go-cart, but it’ll sure as hell fly.
Now, I would never describe the end result of our engineering process as “a messy go-cart,” but despite that one quibble, I mostly agree with Spolsky. There are appropriate places for innovation and design. But most of the time, business goals are best served by a city bus instead of a mobile lounge.
The trick is often to find the appropriate existing core—be it an open-source project, a standard component like DirectShow, some licensable intellectual property, or a city bus—and then extend it. Sticking with proven components based on open standards ultimately means a quicker time to market, lower ongoing maintenance costs, and a more elegant solution.
Howdy Pierce, managing partner and Cardinal Peak co-founder, is a “video guy” whose technical background is in multimedia systems, software engineering and operating systems.
Why do CDs use a sampling rate of 44.1 kHz?
During a recent conference call discussing audio sampling rates, the question came up: Why do CDs use a sampling rate of 44.1 kHz?
First, a little background: When you sample an audio waveform, you have a choice as to how many samples you take per second. Over the years, a number of standards have developed; in digital media used for entertainment purposes, the two most common sampling frequencies are 44.1 kHz and 48 kHz.
As a video guy, I think of 48 kHz as the “natural” choice for audio sampling. It is the frequency used in most digital television applications, including DVD and HDTV. It’s an even multiple of the sampling rate used in telephony—8 kHz—so conversions are relatively straightforward.
But most music is sampled at 44.1 kHz, because this is the standard used for CD audio. The question we were asking in my conference call was: Why were CDs standardized around this sampling frequency?
Although we think of 44.1 kHz as an audio standard, the great Internet Oracle says that this magic number was actually derived from the early use of video recorders to record audio. Evidently creating a recorder capable of recording at around 1.4 Mbps—the data rate of uncompressed digital audio—was a difficult feat back in the day, so engineers of that time repurposed analog video recorders in order to record digital audio. If you modulate a digital audio stream in such a manner that you encode three samples of audio on every visible line of video, then you can record audio in real time on a VCR if you sample at exactly 44.1 kHz—or so the story goes.
The math from the FAQ linked above works, with some caveats. Take this excerpt from Digital Interface Handbook (Francis Rumsey and John Watkinson, Third Edition, p. 53):
In 60 Hz video, there are 35 blanked lines, leaving 490 lines per frame, or 245 lines per field for samples. If three samples are stored per line, the sampling rate becomes 60 × 245 × 3 = 44.1 kHz.…
The sampling rate of 44.1 kHz came to be that of the Compact Disc. Even though CD has no video circuitry, the equipment used to make CD masters was originally video based and determined the sampling rate.
This sounds good, except that NTSC video actually runs at 29.97 frames per second, which makes the field rate come out to 59.94 instead of 60. And the sampling rate of 59.94 × 245 × 3 = 44,055.9, so it’s off by just a little. (The math works exactly for PAL video.) But I’m willing to assume that there was a way for engineers to jury rig the VCR to run at exactly 30 fps, and not 29.97, and then the math would come out correctly. (If you can shed more light on this discrepancy, please leave a comment to this post. I’d love to get to the bottom of it.)
Incidentally, this math suggests that three samples were stored as black-and-white pixels. We’re talking about stereo audio here, and presumably even in the 1970s people were sampling at 16 bits per sample, which implies that 12 bytes or 96 bits would have been encoded per line of video.
The whole story reminds me of that email—possibly apocryphal—about how the width of the space shuttle rocket booster is related to Roman war chariots.
But maybe the interesting question isn’t why CDs use a 44.1 kHz sampling rate, but rather why digital video uses 48 kHz. The reason this seems like an interesting question is that there’s less data to compress at lower sampling frequencies. Specifically, 44.1 kHz sampling leads to about 8 percent fewer bytes before compression than 48 kHz does. So you’d expect 44.1 kHz audio to be more widely used in digital video, because it should be able to deliver the “CD experience” at a lower overall data rate.
Because of the Nyquist theorem, we know that the maximum frequency that can be represented at any given sampling rate is half the sampling rate; thus a 44.1 kHz CD can capture tones up to 22.05 kHz, while a 48 kHz DVD can capture tones up to 24 kHz. The limit of human hearing is roughly 20 kHz, so in a theoretical, spherical-cow world, it seems like both capture standards would meet the requirement of fully capturing the entire audible spectrum.
In the real world, of course, cows aren’t spherical. In practice there are aliasing artifacts near the limit of the filter, with less computationally complex filters having worse aliasing. So the point of the 48 kHz sampling rate used in digital video is to buy enough headroom for simple filters to operate without introducing audible artifacts.
Still, these standards were written a relatively long time ago. Today we’ve had several more turns of Moore’s law. So maybe a capacity-constrained network operator might want to consider jumping to 44.1 kHz audio sampling, at the cost of a little more filtering logic in the decoder.
Howdy Pierce, managing partner and Cardinal Peak co-founder, is a “video guy” whose technical background is in multimedia systems, software engineering and operating systems. Read more about Cardinal Peak’s digital video expertise.
Video-Aware Network Elements
Mitch Vine responded to my last post with this thought-provoking question:
When video is streamed across IP networks or across wireless network links, the network links can sometimes be a bottleneck, unable to perform at desired data rates. Any thoughts on how network elements in the path between the encoder and decoder can be better citizens in the streaming process.? So for example if a wireless element is essentially a layer 2 bridge, transparent to TCP or UDP, what would you do to minimize its impact on video quality in times when the wireless element has temporarily slowed due to some external condition like RF interference.
I thought this was interesting enough to merit a response in the form of a full blog post. The post Mitch responded to was about RTP video, but the same answer applies to a wide range of streaming video protocols, including RTP, ASF, and MPEG-2.
Approaches that could be applied today
There are two strategies that can be employed today to adapt the data rate of a video stream when needed:
- Send a message back to the video source, and ask it to reduce its transmission rate. I’m not going to talk too much about this, because it’s not a general solution; it doesn’t work well in the case of multicast video, and it doesn’t work well in the case of pre-encoded video delivered from a video-on-demand server.
- Be intelligent about which packets to drop. This is interesting because the subjective disruption caused by losing a packet varies with the importance of that packet. If you can pick the right packets to shoot in the head, you can minimize the glitches seen by the user.
Concentrating on the second approach, we can think about what packets to drop. I’m not certain how complex we are allowed to assume the wireless element is, or how much buffering such a device has. Basically, the more deeply this device is capable of examining the data it is carrying–and the more packets it can look at before selecting which one to drop–the smarter it can be about selecting exactly the right packets to discard.
So here’s a list of strategies in increasing order of complexity:
First, I would always discard UDP packets before TCP packets. The rationale for this is that a discarded TCP packet is just going to cause a retransmission in a very short period of time, so unless you are only trying to reduce the data rate right at this instant, discarding a TCP packet is probably a bad idea if it can be avoided.
So at this point, assume that we have a handful of UDP packets, all carrying video and audio. (I have no way of offering a prioritization between video data and data for other applications. As a video guy, of course, I’d say anything else is less important!)
Within a set of UDP streams, I’d try to eliminate unicast packets before multicast packets, on the theory that multicast video is likely to have more than one person watching, and it’s more likely to be a live transmission.
If you have more than one unicast stream, there are two theories as to how to proceed. You can either democratically spread the pain around by dropping a few packets from each stream, or you can blatantly discriminate by randomly picking one stream and beating the hell out of it while attempting to preserve the other streams. Personally I think the discrimination approach is the best answer; you’ll probably get that one user to give up on her stream altogether, and then you’ve pretty dramatically reduced the overall data rate.
Within any one stream, we’d prefer dropping video packets to audio packets, for two reasons. One, the audio has a much lower data rate, so there’s not that much savings there. And two, for most users and most applications, losing audio (which will manifest as a dropout or garbled period) is subjectively worse than losing a bit of video.
We’re starting to get to the point where the next few tricks require pretty deep introspection into the data packets, so they’re probably not super feasible. But even if this is mostly academic, let’s keep going.
Within a video stream, it is best not to drop packets in a key frame. For instance, in the MPEG codecs, it’s best to eliminate packets from B frames before eliminating packets from reference (I or P) frames. This is because the video glitch resulting from a corrupted B frame will only last one frame-time, or 33 msec if the video in question is 30 frames per second. Many users won’t even notice it. In contrast, the glitch resulting from a corrupted I frame will extend over the entire group of pictures, and that could be anywhere from a half second up to several seconds – definitely noticeable.
Finally, if I get to be really picky, it’s best to drop packets that are towards the end of the frame. Compressed video is transmitted in a top-to-bottom manner, so if you’re going to corrupt a frame of video, you’ll minimize the impact if you drop one of the latter packets in the frame.
There are probably some more strategies, so if you’ve got thoughts please share them in the comments.
Approaches that might work in the future
One conceptually simple approach to solve this problem would be for the source of the video to somehow indicate the relative importance of each packet in a header that is easily interpreted by network elements. This is the basic idea behind QoS, and you can find lots of variations on the Internet. A slightly different approach, used by DiffServ and MPLS, would be to group similar streams together into pre-defined priority classes at an ingress point; this allows interior network nodes to drop lower-priority traffic.
It’s worth mentioning that there is a class of codecs that involve what is known as scalable video compression (here’s an example). The general approach is to produce a base layer plus one or more enhancement layers, where the base layer by itself is decodable and will produce a useful image, and the enhancement layers add resolution or otherwise increase the video quality. If you’re lucky enough to have such a codec, then obviously the strategy would be to drop enhancement layer packets. Unfortunately, I’m not aware of scalable video compression being used in any commercially interesting manner currently.
Finally, another potential approach is to rate-shape or transrate the video stream as needed by dynamically recoding packets, right in the middle of the network, to reduce video quality on the fly. Devices exist that do just this for use in cable head-ends, but I’m going to assume that the computational complexity of rate shaping is too high for your basic Layer 2 bridge today. (Drop me a line if I’m wrong; we’d love to build a little embedded rate-shaper….)
Thanks for the question!








