Archive for the ‘Video’ Category
If only we had better test content…
I just saw this news about research that says you notice compression artifacts less if you like the content of a particular video clip:
Using four studies, Kortum, along with co-author Marc Sullivan of AT&T Labs, showed 100 study participants 180 movie clips encoded at nine different levels, from 550 kilobits per second up to DVD quality. Participants viewed the two-minute clips and then were asked about the video quality of the clips and desirability of the movie content.
Kortum found a strong correlation between the desirability of movie content and subjective ratings of video quality.
(The original paper seems to be here, beyond a pay wall.)
Makes me wonder about the classic test footage with the calendar and the model train!
Creating Single Frame Movies
My camera (an Olympus SP-570UZ) allows me to optionally record a four-second audio clip with each photo I take. I haven’t used this feature much, because I typically upload my photos to Flickr, and there’s been no good way to associate the audio with the video. Ideally, I would like an audio player to appear below the photo, but there aren’t really any public audio sharing websites with much longevity. And, in any case, Flickr won’t allow me to embed an audio player in my photo description.
Recently, it occurred to me that since Flickr allows short movies (up to 1:30 long), maybe I could create a single-frame movie with the still picture as the frame and the audio as the sound track. Then the Flickr movie player would serve as the control for the audio, and the audio and the video would stay associated with each other.
I decided to try to use ffmpeg to create the movie, since it seems to be able to do almost anything with video and audio. The command line for ffmpeg is a bit obscure, so this blog post documents about two hours of my time spent getting it to work.
My camera produces 3648×2736 JPEG images, and the audio files are 8 kHz sample rate, mono, 8 bit unsigned PCM samples in WAV file format. I decided my goal would be to create a motion JPEG (MJPEG) encoded AVI file with maximum quality.
I started by searching the web to see if anyone had done this before. By studying those examples and experimenting, I came up with the following ffmpeg command line:
ffmpeg.exe -loop_input -shortest -f image2 -r 0.25 -i P910033.jpg -i P910033.wav -vcodec mjpeg -qscale 1 -t 4 foo.avi
Most of my attempts caused ffmpeg to hang. But eventually, I got the error message below:
Duration: 00:00:04.00, start: 0.000000, bitrate: N/A
Stream #0.0: Video: mjpeg, yuvj422p, 3648x2736, 0.25 tbr, 0.25 tbn, 0.25 tbc
[wav @ 01a80050]Estimating duration from bitrate, this may be inaccurate
Input #1, wav, from 'P6060033.wav':
Duration: 00:00:04.02, bitrate: 64 kb/s
Stream #1.0: Audio: pcm_u8, 8000 Hz, 1 channels, u8, 64 kb/s
[mp2 @ 01ac6310]Sampling rate 8000 is not allowed in mp2
Output #0, avi, to 'foo.avi':
Stream #0.0: Video: mjpeg, yuvj422p, 3648x2736, q=2-31, 200 kb/s, 90k tbn, 0
.25 tbc
Stream #0.1: Audio: mp2, 8000 Hz, 1 channels, s16, 64 kb/s
Stream mapping:
Stream #0.0 -> #0.0
Stream #1.0 -> #0.1
Error while opening encoder for output stream #0.1 - maybe incorrect parameters such as bit_rate, rate, width or height
At last I understood the problem: ffmpeg needs the audio sampled at some rate other than 8 kHz. So I decided to use Audacity, another open source application, to upsample the sound. However, now Audacity was unhappy with this audio format.
So I used Project->Import Raw Data, and selected my WAV file. I set up the import with the following parameters:
I knew this would work, because the WAV file format consists of a header, followed by PCM data, in this case 8 kHz unsigned samples. So the result in the audio editor would be an audio file with the WAV header as a noisy sound at the start, followed by the data I wanted. The selected (darker) portion of the WAV file below is the header. I used Edit->Cut to remove it.
Finally, I tried to save the audio at a different sample rate. The audio file has a pulldown menu that lets you change the sample rate, but it doesn’t do what I wanted—what it does is play the audio file back at a different rate with aliasing.
Instead, after consulting the Audacity documentation, I discovered you use the menu in at the lower left corner of the main Audacity window to set the sample rate.
Change this to 48000, and choose File->Export as WAV to save at the new sample rate. I re-ran ffmpeg, and the resulting AVI file would play in QuickTime and VLC player (although VLC crashes afterwards), but it would not work in Windows Media Player (audio played, no video), divx, realplayer, or Flickr. So, I decided to try encoding to mp4 instead with the following command:
ffmpeg.exe -loop_input -shortest -f image2 -r 0.25 -i P910033.jpg -i P910033.wav bar.mp4
The resulting mp4 file plays in all the media players (although, again, VLC crashes after playing it), and Flickr can read it successfully as well. Here is what it looks like on Flickr:
Using size as a proxy for quality, however, the encoded video is much smaller than the input JPEG file. Can someone suggest additional flags to ffmpeg to improve the encoding quality?
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
The Basics of 3D Image Acquisition
One of our clients is heavily involved in 3D video and has been for several years. However, several are just now starting to think about it because of the uptick of interest in the consumer electronics world. Enough questions have been posed to us recently that it seemed worthwhile to me to pull together a few basic facts regarding 3D stereopair imaging and stereo disparity.
First, we need a simple model of a lens. Consider the diagram below:
In this picture, the long horizontal line that passes through the center of the lens is called the lens axis. The lens has the property that rays that pass through the center of the lens are undeviated. Therefore, the ray from the top of the tree, at a distance l to the left of the lens, passes straight through the center of the lens. (The tree has a height of h.) The lens also has the property that rays that arrive perpendicular to the lens are refracted to pass through the focal point of the lens. The focal point lies on the lens axis and is a distance f from the center of the lens. The intersection of these two rays shows where the image of the tree will be formed. You can see that the image of the tree is upside down, and has a new height h’. The image is formed a distance d to the right of the focal point.
By using similar triangles we see first that
Using a different pair of similar triangles we also see that
Solving the first equation above for h’, substituting the result into the second equation and simplifying, we derive the following relationship:
This is the fundamental equation of a simple lens. It shows that as the object gets further and further from the lens, i.e. as l increases, the distance of the image of the object from the focal plane decreases, i.e. d gets smaller. We can assume that the camera’s image sensor is located at a distance f from the lens, is perpendicular to the lens axis, and that all objects more than a certain distance away from the lens will be in focus. In other words, the image of all sufficiently distant objects will appear on the focal plane where the image sensor is located.
In the case of 3D video, two cameras are used to acquire a sequence of stereopair images, one from the left camera and one from the right. Different stereo geometries are possible, but the most common one is to place the two cameras horizontally apart from each other by a distance i, and to keep their focal planes coplanar. The diagram below illustrates this configuration:
The horizontal line at the bottom is the focal plane; it is clear from the diagram that the focal planes are coplanar. The lenses are a distance f from the focal plane and are separated by a distance of i from each other. We assume that a small object (or a point on a larger object) is located a distance l from the lens plane and a distance m to the right of the axis of the right lens. We want to know where the image of that object appears in the left and the right camera. In particular, we want to know if we overlaid the left image on top of the right image, how far apart would the images appear? Mathematically, we want to know the disparity, which we define to be
where s1 and s2 are the distances from the image point to the intersection of the lens axis with the focal plane for the left and the right cameras respectively. Note that we are assuming that the object being imaged is far enough away that its image forms on the focal plane.
Using our favorite trick of similar triangles we have the following two equations:
and
Solving the first equation for s1, the second equation for s2, taking the difference and simplifying yields
Although this expression was derived for an object to the right of the axis of the right camera, it is easy to show in a similar manner that it is also true for an object between the axes of the two cameras as well as for an object to the left of the axis of the left camera.
So what does this equation tell us? First, it says that for this particular camera geometry, the disparity is only a function of the separation between the two cameras, i, and the distance of the object from the lens plane, l. Second the equation tells us that the disparity increases as we increase the separation between the cameras. Finally, it tells us that the disparity decreases as the object gets further away from the cameras, approaching zero for objects an infinite distance away. (You can see this when you watch 3D content without wearing the special 3D glasses: The “distant” objects can be seen by the naked eye, whereas the near objects appear blurry to the naked eye, because the value of ρ is greater.)
It should be clear from this equation that if a stereopair is available, and corresponding points can be found in the left and right pictures, that the disparity between those points can be measured, and the distance to the point can be computed.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Thoughts on 3D after NAB
I just returned from this year’s NAB show, where I was bombarded with 3D demos in virtually every booth. Most of the factors driving this 3D superabundance originate outside of the broadcast industry itself. First, TV manufacturers are hot on 3D as a way to get everyone who just bought an HDTV to upgrade to a new 3D enabled display. Cinema owners like 3D because they can charge more for the tickets. The Blu-ray consortium recently standardized a method for storing 3D video on BD discs, and hopes to enable and piggyback on the efforts of the TV manufacturers as a way to drive sales of 3D Blu-ray players and finally displace DVDs. Hollywood has begun producing more 3D movies too, including the phenomenally successful Avatar (which will be released on 3D Blu-ray very soon). So naturally the broadcast industry needs to be prepared to author and carry 3D signals.
Most of the demos I saw were nothing more than “here, put on these glasses”…in other words, “me too” type demos. (Although after seeing all this 3D interest I did wonder if Philips regretted terminating their auto stereographic display effort last year!).
Nevertheless, I did see two problems addressed that I found technically interesting. First, real-time 2D to 3D conversion (see here and here), and second, automatic 3D quality monitoring.
I worked a lot on the problem of compressing stereopairs as part of my Ph.D. research, and I also spent time thinking about 3D video quality assessment. However, I had never considered the problem of real-time 2D to 3D conversion, so the show got me thinking. It’s a pretty tricky problem!
Converting a 2D video stream to 3D can be partitioned into two fundamental steps. First, creating a depth map for each video image, and second, using the depth map to construct a second viewpoint. Although both steps are challenging, the first step feels substantially harder to me.
With regards to the first step, a sequence of 2D video images must be analyzed to extract a depth map. Several special cases are worth discussing, but I’ll only mention two. First, consider the case where the camera is stationary and a 3D object moves through the field of view. The closer points on that object will have frame-to-frame pixel displacements that are larger than those for object points that are further away. Therefore, one useful approach for deriving information for a depth map would be the following: a) segment the image into two regions: moving and stationary; b) segment the moving areas into distinct objects using various clues such as color and proximity; c) find distinct matching points on the moving objects in two different frames; d) determine depths for those matching points based on the measured point displacements; e) interpolate the depth map for non-matched moving object points.
As a second special case, consider the situation where nothing is moving in the video sequence for many frames in a row. In this case, occlusion becomes a major depth cue. If one object is in front of another, then it will occlude the background object, and it must be closer. If an image can be segmented into objects, and an occlusion map can be deduced, then different depths can be assigned to different objects based on where they lie in the occlusion map. Other clues that may be algorithmically exploitable could stem from perspective considerations applied to the edges of identified objects.
Many powerful depth clues will be hard to take advantage of algorithmically—although humans can exploit them easily—because they involve recognizing objects. For example, we can easily recognize two humans in a picture, and determine whether or not they are adults or children. We know that if two adult males appear in the picture, and one appears substantially taller than the other (and isn’t holding a basketball), then the shorter one is further away. I suspect that taking advantage of this sort of knowledge is beyond the capability of today’s real-time (and non real-time!) processing. Nevertheless, I was amazed at how well the systems I saw at the show appeared to work.
With regards to the second step, given an image and a depth map, a second view can be created from the first by displacing each pixel of the original image with a disparity value corresponding to its depth. In practice it won’t be that easy. Why? Because after displacing pixels with their appropriate disparities, gaps will appear in the new image. These gaps result from image detail that is visible in one image but not in the other, so the gaps will need to be interpolated or otherwise synthesized in some reasonable way.
With regards to 3D video quality assessment, I just want to interject a note of caution. I was encouraged to see that several vendors have made progress in developing systems that automatically approximate the “mean opinion scores” that subjective human evaluation tests would assign to various image sequences. However, when dealing with 3D video, the sum is greater than the parts. If the algorithmic approach implemented is to naively apply 2D image quality assessment to the left and right pictures independently, and then average the scores together, the result is likely to not correspond at all to a human’s subjective viewing experience. For those of you who wear glasses, like me, you can experience this directly if one of your eyes is better than the other. Take off our glasses, look at the world around you, and you will see it with the resolution of your better eye; but you will still have stereoscopic vision. This effect will ultimately need to be taken into account in automatic systems that purport to algorithmically assess the quality of a 3D image sequence.
Encoders Aren’t Commodities
My partner Ben Mesander had a really cool post the other day: An h.264 encoder written in 30 lines of C code.
Ben’s encoder outputs completely valid h.264, but it doesn’t actually compress anything. (What do you expect from 30 lines!) In fact, because of the necessary h.264 headers, the output of Ben’s encoder is larger than the input.
This is a dramatic example of something that I find interesting about the codec marketplace: Decoders are commodities, but encoders are highly differentiated. People often mis-understand this dynamic, however.
A video decoder, if it works, has to follow the relevant specification. There are hundreds of “tricks” that a baseline profile h.264 encoder could use, and so a baseline-profile decoder must be able to handle all of them. So there’s really not room for a lot of differentiation between h.264 decoders. Sure, one decoder might use less CPU than another. But mostly, if you’re looking to buy a decoder, you should shop based on price.
Another way to say the same thing is that a codec specification details how to write a decoder. The spec lays out what a compliant bitstream looks like, and specifies how you turn that bitstream into video or audio.
Encoders, as Ben showed, are completely different beasts. An encoder author can pick which of the tools provided by the standard he or she will use. In the extreme case, as Ben did, he can choose to use almost none of the tools. Therefore, there can be a huge difference in compression efficiency—and thus video quality—between two encoders.
You might think this is obvious, but if so you should walk around the security industry’s ISC West trade show this week. You will find all sorts of vendors claiming that their h.264 DVR is the same as their competitor’s DVR, or claiming that their h.264 IP camera is better than a MPEG-4 IP camera. Maybe so, and maybe not: Just because h.264 is a more modern and complex codec than MPEG-4 part 2, it doesn’t automatically follow that a particular h.264 encoder is better than a particular MPEG-4 encoder.
Ultimately, the only way to compare two encoders is a head-to-head bakeoff, where each encoder is set to the same data rate and fed the same content, and you view decoded video from the two at the same time.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
World’s Smallest h.264 Encoder
Recently I have been studying the h.264 video codec and reading the ISO spec. h.264 a much more sophisticated codec than MPEG-2, which means that a well-implemented h.264 encoder has more compression tools at its disposal than the equivalent MPEG-2 encoder. But all that sophistication comes at a price: h.264 also has a big, complicated specification with a plethora of options, many of which are not commonly used, and it takes expertise to understand which parts are important to solve a given problem.
As a bit of a parlor trick, I decided to write the simplest possible h.264 encoder. I was able to do it in about 30 lines of code—although truth in advertising compels me to admit that it doesn’t actually compress the video at all!
While I don’t want to balloon this blog post with a detailed description of h.264, a little background is in order. An h.264 stream contains the encoded video data along with various parameters needed by a decoder in order to decode the video data. To structure this data, the bitstream consists of a sequence of Network Abstraction Layer (NAL) units.
Previous MPEG specifications allowed pictures to be coded as I-frames, P-frames, or B-frames. h.264 is more complex and wonderful. It allows individual frames to be coded as multiple slices, each of which can be of type I, P, or B, or even more esoteric types. This feature can be used in creative ways to achieve different video coding goals. In our encoder we will use one slice per frame for simplicity, and we will use all I-frames.
As with previous MPEG specifications, in h.264 each slice consists of one or more 16×16 macroblocks. Each macroblock in our 4:2:0 sampling scheme contains 16×16 luma samples, and two 8×8 blocks of chroma samples. For this simple encoder, I won’t be compressing the video data at all, so the samples will be directly copied into the h.264 output.
With that background in mind, for our simplest possible encoder, there are three NALs we have to emit:
- Sequence Parameter Set (SPS): Once per stream
- Picture Parameter Set (PPS): Once per stream
- Slice Header: Once per video frame
- Slice Header information
- Macroblock Header: Once per macroblock
- Coded Macroblock Data: The actual coded video for the macroblock
Since the SPS, the PPS, and the slice header are static for this application, I was able to hand-code them and include them in my encoder as a sequence of magic bits.
Putting it all together, I came up with the following code for what I call “hello264”:
#include <stdio.h>
#include <stdlib.h>
/* SQCIF */
#define LUMA_WIDTH 128
#define LUMA_HEIGHT 96
#define CHROMA_WIDTH LUMA_WIDTH / 2
#define CHROMA_HEIGHT LUMA_HEIGHT / 2
/* YUV planar data, as written by ffmpeg */
typedef struct
{
uint8_t Y[LUMA_HEIGHT][LUMA_WIDTH];
uint8_t Cb[CHROMA_HEIGHT][CHROMA_WIDTH];
uint8_t Cr[CHROMA_HEIGHT][CHROMA_WIDTH];
} __attribute__((__packed__)) frame_t;
frame_t frame;
/* H.264 bitstreams */
const uint8_t sps[] = { 0x00, 0x00, 0x00, 0x01, 0x67, 0x42, 0x00,
0x0a, 0xf8, 0x41, 0xa2 };
const uint8_t pps[] = { 0x00, 0x00, 0x00, 0x01, 0x68, 0xce,
0x38, 0x80 };
const uint8_t slice_header[] = { 0x00, 0x00, 0x00, 0x01, 0x05, 0x88,
0x84, 0x21, 0xa0 };
const uint8_t macroblock_header[] = { 0x0d, 0x00 };
/* Write a macroblock's worth of YUV data in I_PCM mode */
void macroblock(const int i, const int j)
{
int x, y;
if (! ((i == 0) && (j == 0)))
{
fwrite(¯oblock_header, 1, sizeof(macroblock_header),
stdout);
}
for(x = i*16; x < (i+1)*16; x++)
for (y = j*16; y < (j+1)*16; y++)
fwrite(&frame.Y[x][y], 1, 1, stdout);
for (x = i*8; x < (i+1)*8; x++)
for (y = j*8; y < (j+1)*8; y++)
fwrite(&frame.Cb[x][y], 1, 1, stdout);
for (x = i*8; x < (i+1)*8; x++)
for (y = j*8; y < (j+1)*8; y++)
fwrite(&frame.Cr[x][y], 1, 1, stdout);
}
/* Write out PPS, SPS, and loop over input, writing out I slices */
int main(int argc, char **argv)
{
int i, j;
fwrite(sps, 1, sizeof(sps), stdout);
fwrite(pps, 1, sizeof(pps), stdout);
while (! feof(stdin))
{
fread(&frame, 1, sizeof(frame), stdin);
fwrite(slice_header, 1, sizeof(slice_header), stdout);
for (i = 0; i < LUMA_HEIGHT/16 ; i++)
for (j = 0; j < LUMA_WIDTH/16; j++)
macroblock(i, j);
fputc(0x80, stdout); /* slice stop bit */
}
return 0;
}
(This source code is available as a single file here.)
In main(), the encoder writes out the SPS and PPS. Then it reads YUV data from standard input, stores it in a frame buffer, and then writes out a h.264 slice header. It then loops over each macroblock in the frame and calls the macroblock() function to output a macroblock header indicating the macroblock is coded as I_PCM, and inserts the YUV data.
To use the code, you will need some uncompressed video. To generate this, I used the ffmpeg package to convert a QuickTime movie from my Kodak Zi8 video camera from h.264 to SQCIF (128×96) planar YUV format sampled at 4:2:0:
ffmpeg.exe -i angel.mov -s sqcif -pix_fmt yuv420p angel.yuv
I compile the h.264 encoder:
gcc –Wall –ansi hello264.c –o hello264
And run it:
hello264 <angel.yuv >angel.264
Finally, I use ffmpeg to copy the raw h.264 NAL units into an MP4 file:
ffmpeg.exe -f h264 -i angel.264 -vcodec copy angel.mp4
Here is the resulting output:
There you have it—a complete h.264 encoder that uses minimal CPU cycles, with output larger than its input!
The next thing to add to this encoder would be CAVLC coding of macroblocks and intra prediction. The encoder would still be lossless at this point, but there would start to be compression of data. After that, the next logical step would be quantization to allow lossy compression, and then I would add P slices. As a development methodology, I prefer to bring up a simplistic version of an application, get it running, and then add refinements iteratively.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
The Math Behind Analog Video Resolution
The world is moving in the direction of HDTV, but NTSC “standard def” signals are still common for many reasons and will remain so. One important reason is that cameras that output NTSC are widely available and cheap! Many applications, including a lot of security applications, simply don’t require the resolution of HDTV…and don’t want to incur the camera cost and bandwidth hit it requires.
So, what is resolution anyway, especially with regards to an analog NTSC video signal? Analog video cameras, especially in the CCTV industry, are sold using “horizontal TV lines”, or HTVL, as one of their key specifications. Unfortunately, the math behind that concept is not well understood.
To understand resolution we start with the aspect ratio. The aspect ratio of a picture is the ratio of its width to its height. Different aspect ratios are in use today for different applications. HDTV has an aspect ratio of 16:9. Standard definition TV has an aspect ratio of 4:3. In general, resolution is measured in a circle whose diameter is equivalent to a picture’s smallest dimension. The diagram below illustrates the case for NTSC:

In the above diagram, the circle has a diameter of 3 units or 1 “picture height”.
Now, imagine a uniformly spaced sequence of vertical black lines of constant width. The white space between the black lines should be the same width as the black lines themselves. Counting both black and white lines, how many lines can be physically resolved within the above circle? The answer to this question is the horizontal resolution. Before we can go further, we need some facts regarding NTSC:
First, there are 525 scan lines per picture, and the horizontal scanning frequency is
![]()
This is one of NTSC’s magic numbers. The horizontal line time is therefore approximately 1 / 15,734.26573, or 63.5556 µsec.
Second, the horizontal blanking period is 10.7 µsec. During this period a horizontal sync pulse is transmitted, as well as a chroma burst (to enable decoders to demodulate the correct color), and a reference black level. The active line time—the period during which information is actually being drawn on the visible screen—is therefore 63.5556 – 10.7, or 52.8556 µsec.
Finally, the highest broadcast luminance signal is 4.2 MHz.
Based on the above, we can compute the highest horizontal resolution that can be present in an NTSC signal as follows:
The product of the middle two parameters is the number of complete cycles present in one active line (the MHz and microseconds cancel); the factor of 2 is present because we count both the white and black lines in the horizontal resolution calculation. Multiplying by three-fourths takes into account the circle in which the horizontal resolution is defined. Bear in mind that this pattern would be displayed on an NTSC TV as grey, not as a crisp sequence of black and white lines, due to the rolloff of the various filters used to limit the video bandwidth.
In the vertical direction, resolution is limited by the number of scan lines. There are 480 scan lines in the visible area of a picture, so one would be tempted to assert that the vertical resolution is 480. However, imagine a uniformly spaced sequence of horizontal lines analogous to the vertical line pattern described above. We want this horizontal pattern to be discernible regardless of its relative relationship to the scanning lines. In other words, as the pattern is displaced vertically, the number of lines should still be easily resolved.
Imagine that we have a pattern of 480 horizontal lines, alternating black and white. When these lines are exactly midway between the scanning lines, the resulting picture will be grey. Why? Because the scanning will average the black and white inputs together for each reproduced line. So a pattern of 480 lines would not be discernible: the resolution must be less. The Kell factor measures by how much the vertical resolution is reduced relative to the number of scan lines, and it is usually assumed to be around 0.7 for a stationary pattern. This implies the following vertical resolution:
The horizontal and vertical resolutions are therefore approximately equal. This was one of the design goals of NTSC.
Note that the Kell factor is sometimes assumed to be a larger number for a pattern in motion because visual averaging will cause the eye to “ignore” the occasional grey or blurry pattern. Values as high as 0.9 are assumed for moving images.
Note further that the reason the Kell factor does not apply horizontally is that it is possible to put down the dots on a CRT so close that regardless of the horizontal phase of a 4.2 Mhz vertical line pattern, it will be resolved. However, the Kell factor does apply in the horizontal direction for a digital display such as a computer monitor or LCD panel.
In a digital SDTV system based on the CCIR 601 digital sampling standard, the luminance information is sampled at 13.5 Mhz. The number of samples per active line is therefore given by 13.5 × 52.8556 = 713.56. This number is often rounded up to 720 samples; rounding up provides some headroom on either side of the visible line to hide the edge effects of various digital processing operations such as filtering. Plus, margin is generally good in any design!
Nyquist sampling theory says that sampling at 13.5 Mhz would theoretically allow a horizontal frequency as high as 13.5 / 2 = 6.75 Mhz to be captured. However, because of the inability to implement perfect anti-aliasing filters, this cannot be achieved in practice. A reduction factor of 0.75 is appropriate, implying that the CCIR-601 sampling standard is good for horizontal frequencies as high as 0.75 × 6.75 = 5.06 Mhz. This is substantially better than old-fashioned analog NTSC broadcasts. Therefore, on a good monitor with analog component input, a CCIR-601 signal can achieve the following horizontal resolution:
For a SIF resolution picture (360×240), we are effectively sampling at 6.75 Mhz, not 13.5 Mhz, so the highest horizontal frequency that can be reproduced is 0.75 × (6.75/2) = 2.53 Mhz, which corresponds to a horizontal resolution of:
Bear in mind that to achieve this number one needs to do an excellent job of filtering.
In the case of displaying a SIF picture on a computer monitor, we can approximate the horizontal resolution by using a Kell factor in the horizontal direction. This yields the following:
In the above, the 0.7 is the Kell factor and the 0.75 is to account for the aspect ratio (remember, resolution is computed inside the circle). A value of 360 was used for purity’s sake. The MPEG world uses 352 and 704 because they are related by a factor of 2 and both are multiples of 16 (360 is not a multiple of 16). It also means there is a little less data to compress!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
On the Importance of Encrypting Video
This morning brought a front-page Wall St. Journal article that’s a bit of a jaw-dropper:
Militants in Iraq have used $26 off-the-shelf software to intercept live video feeds from U.S. Predator drones, potentially providing them with information they need to evade or monitor U.S. military operations.
…
The potential drone vulnerability lies in an unencrypted downlink between the unmanned craft and ground control. The U.S. government has known about the flaw since the U.S. campaign in Bosnia in the 1990s, current and former officials said. But the Pentagon assumed local adversaries wouldn’t know how to exploit it, the officials said.
After the Journal article, the Pentagon quickly let it be known that the problem has been fixed. But I’m stunned that it could have happened in the first place.
At one point, I had heard that video from Predator drones was transmitted as unencrypted analog NTSC video, with geo-spatial metadata encoded into the closed-captioning portion of the data stream following this specification—basically an industrial form of those annoyingly-advertised X10 wireless cameras. But I had assumed that the Pentagon would have long since upgraded the system to digital video with some reasonable form of encryption. I guess someone needs to do a little reading on the pros and cons of security through obscurity.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
Uploading Kodak Zi8 Videos to Flickr
Recently my mom bought me a Kodak Zi8 pocket HD video camera for my birthday. Thanks, Mom! You know what an engineer likes! I love photography, and I upload my photos to the Flickr photo sharing site. But I think my mom wanted some more home movies of my daughter.

The first day I had it, I took some videos, and went to share them on Flickr. Unfortunately, the videos did not properly transcode to the Flickr Flash-based format, and Flickr displayed the following:

I checked out Flickr’s video uploading FAQ and discovered that the Kodak Zi6 and Zi8 are the only two unsupported cameras of all the video cameras in the universe. Great, so I have this fun new camera, and it won’t work with my favorite sharing site. The Flickr staff hasn’t come up with a solution in over a year. But I don’t give up easy.
I talked to my partner Howdy, and thought about the possible things that might cause problems with Flickr’s transcoding process. The Zi8 produces MOV files, and within the MOV file there is an audio stream and a video stream, each encoded with some codec. The problem could lie with the outer system level MOV container, or one or both of the codecs. I decided to look inside the files and see what codecs were in use. Howdy suggested using tools from mpeg4ip for this, but I decided to be lazy and use the VLC media player, as I already had it installed on my computer.
I opened up one of the movies recorded with the camera, and went to the Tools->Codec Information window in VLC. This showed me that the video codec was avc1 and the audio was mp4a.

Now, “avc1” is just h.264 by another name, and “mp4a” is AAC—standard MPEG-4 audio. Nothing looks too unusual. Well, MOV files are an Apple pseudo-standard that has evolved over time, so maybe the transcoder used by Flickr doesn’t like something about the MOV files produced by the Zi8.
Fortunately, I know I can convert from a MOV format container to a MPEG-4 container without transcoding either the video or the audio. Transcoding normally involves decoding and then encoding, so you want to avoid it for two reasons: One, it introduces an additional generation of lossy compression, so quality will suffer. And two, it’s computationally intensive, so therefore slow.
At the system layer, the MPEG-4 file format, formally specified in ISO/IEC 14496-14, is based on work originally done at Apple. As a result, the MOV and MP4 formats are very similar.
I decided to try ffmpeg, the open-source Swiss Army knife of video conversions. We have used ffmpeg on a number of different projects here at Cardinal Peak, and we’ve found it to be very reliable. It is available for Windows, Mac OS/X, and Linux. Because of its flexibility and its command line interface, it can be a little tricky to use, but I figured out after some experimentation that I could change the MOV file to an MP4 file with the following command:
ffmpeg -i input.MOV -f mp4 -vcodec copy -acodec copy output.mp4
In this command line, input.MOV is the file from the Zi8 or Zi6 (e.g., 115_0164.MOV), and output.mp4 is the name of the output file. The -f mp4 tells ffmpeg to convert the input to mp4 format, and the -acodec copy and -vcodec copy tell it to copy the audio and video data from the MOV container to the mp4 container without transcoding it.
The command executes quickly, because the audio and video formats are not changed. I tried uploading the resulting .mp4 files to Flickr, and they work fine! I’m happy because I can use my new video camera with my favorite sharing site, and hopefully help others out.

UPDATE 11/12/2009: It looks like Flickr might now be accepting the raw MOV files from the Zi8 directly. They have removed a previous caveat from their video uploading FAQ linked above. Please leave a comment if you have more information.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Video-Aware Network Elements
Mitch Vine responded to my last post with this thought-provoking question:
When video is streamed across IP networks or across wireless network links, the network links can sometimes be a bottleneck, unable to perform at desired data rates. Any thoughts on how network elements in the path between the encoder and decoder can be better citizens in the streaming process.? So for example if a wireless element is essentially a layer 2 bridge, transparent to TCP or UDP, what would you do to minimize its impact on video quality in times when the wireless element has temporarily slowed due to some external condition like RF interference.
I thought this was interesting enough to merit a response in the form of a full blog post. The post Mitch responded to was about RTP video, but the same answer applies to a wide range of streaming video protocols, including RTP, ASF, and MPEG-2.
Approaches that could be applied today
There are two strategies that can be employed today to adapt the data rate of a video stream when needed:
- Send a message back to the video source, and ask it to reduce its transmission rate. I’m not going to talk too much about this, because it’s not a general solution; it doesn’t work well in the case of multicast video, and it doesn’t work well in the case of pre-encoded video delivered from a video-on-demand server.
- Be intelligent about which packets to drop. This is interesting because the subjective disruption caused by losing a packet varies with the importance of that packet. If you can pick the right packets to shoot in the head, you can minimize the glitches seen by the user.
Concentrating on the second approach, we can think about what packets to drop. I’m not certain how complex we are allowed to assume the wireless element is, or how much buffering such a device has. Basically, the more deeply this device is capable of examining the data it is carrying–and the more packets it can look at before selecting which one to drop–the smarter it can be about selecting exactly the right packets to discard.
So here’s a list of strategies in increasing order of complexity:
First, I would always discard UDP packets before TCP packets. The rationale for this is that a discarded TCP packet is just going to cause a retransmission in a very short period of time, so unless you are only trying to reduce the data rate right at this instant, discarding a TCP packet is probably a bad idea if it can be avoided.
So at this point, assume that we have a handful of UDP packets, all carrying video and audio. (I have no way of offering a prioritization between video data and data for other applications. As a video guy, of course, I’d say anything else is less important!)
Within a set of UDP streams, I’d try to eliminate unicast packets before multicast packets, on the theory that multicast video is likely to have more than one person watching, and it’s more likely to be a live transmission.
If you have more than one unicast stream, there are two theories as to how to proceed. You can either democratically spread the pain around by dropping a few packets from each stream, or you can blatantly discriminate by randomly picking one stream and beating the hell out of it while attempting to preserve the other streams. Personally I think the discrimination approach is the best answer; you’ll probably get that one user to give up on her stream altogether, and then you’ve pretty dramatically reduced the overall data rate.
Within any one stream, we’d prefer dropping video packets to audio packets, for two reasons. One, the audio has a much lower data rate, so there’s not that much savings there. And two, for most users and most applications, losing audio (which will manifest as a dropout or garbled period) is subjectively worse than losing a bit of video.
We’re starting to get to the point where the next few tricks require pretty deep introspection into the data packets, so they’re probably not super feasible. But even if this is mostly academic, let’s keep going.
Within a video stream, it is best not to drop packets in a key frame. For instance, in the MPEG codecs, it’s best to eliminate packets from B frames before eliminating packets from reference (I or P) frames. This is because the video glitch resulting from a corrupted B frame will only last one frame-time, or 33 msec if the video in question is 30 frames per second. Many users won’t even notice it. In contrast, the glitch resulting from a corrupted I frame will extend over the entire group of pictures, and that could be anywhere from a half second up to several seconds – definitely noticeable.
Finally, if I get to be really picky, it’s best to drop packets that are towards the end of the frame. Compressed video is transmitted in a top-to-bottom manner, so if you’re going to corrupt a frame of video, you’ll minimize the impact if you drop one of the latter packets in the frame.
There are probably some more strategies, so if you’ve got thoughts please share them in the comments.
Approaches that might work in the future
One conceptually simple approach to solve this problem would be for the source of the video to somehow indicate the relative importance of each packet in a header that is easily interpreted by network elements. This is the basic idea behind QoS, and you can find lots of variations on the Internet. A slightly different approach, used by DiffServ and MPLS, would be to group similar streams together into pre-defined priority classes at an ingress point; this allows interior network nodes to drop lower-priority traffic.
It’s worth mentioning that there is a class of codecs that involve what is known as scalable video compression (here’s an example). The general approach is to produce a base layer plus one or more enhancement layers, where the base layer by itself is decodable and will produce a useful image, and the enhancement layers add resolution or otherwise increase the video quality. If you’re lucky enough to have such a codec, then obviously the strategy would be to drop enhancement layer packets. Unfortunately, I’m not aware of scalable video compression being used in any commercially interesting manner currently.
Finally, another potential approach is to rate-shape or transrate the video stream as needed by dynamically recoding packets, right in the middle of the network, to reduce video quality on the fly. Devices exist that do just this for use in cable head-ends, but I’m going to assume that the computational complexity of rate shaping is too high for your basic Layer 2 bridge today. (Drop me a line if I’m wrong; we’d love to build a little embedded rate-shaper….)
Thanks for the question!






