Posts Tagged ‘Video’
Thoughts on 3D after NAB
I just returned from this year’s NAB show, where I was bombarded with 3D demos in virtually every booth. Most of the factors driving this 3D superabundance originate outside of the broadcast industry itself. First, TV manufacturers are hot on 3D as a way to get everyone who just bought an HDTV to upgrade to a new 3D enabled display. Cinema owners like 3D because they can charge more for the tickets. The Blu-ray consortium recently standardized a method for storing 3D video on BD discs, and hopes to enable and piggyback on the efforts of the TV manufacturers as a way to drive sales of 3D Blu-ray players and finally displace DVDs. Hollywood has begun producing more 3D movies too, including the phenomenally successful Avatar (which will be released on 3D Blu-ray very soon). So naturally the broadcast industry needs to be prepared to author and carry 3D signals.
Most of the demos I saw were nothing more than “here, put on these glasses”…in other words, “me too” type demos. (Although after seeing all this 3D interest I did wonder if Philips regretted terminating their auto stereographic display effort last year!).
Nevertheless, I did see two problems addressed that I found technically interesting. First, real-time 2D to 3D conversion (see here and here), and second, automatic 3D quality monitoring.
I worked a lot on the problem of compressing stereopairs as part of my Ph.D. research, and I also spent time thinking about 3D video quality assessment. However, I had never considered the problem of real-time 2D to 3D conversion, so the show got me thinking. It’s a pretty tricky problem!
Converting a 2D video stream to 3D can be partitioned into two fundamental steps. First, creating a depth map for each video image, and second, using the depth map to construct a second viewpoint. Although both steps are challenging, the first step feels substantially harder to me.
With regards to the first step, a sequence of 2D video images must be analyzed to extract a depth map. Several special cases are worth discussing, but I’ll only mention two. First, consider the case where the camera is stationary and a 3D object moves through the field of view. The closer points on that object will have frame-to-frame pixel displacements that are larger than those for object points that are further away. Therefore, one useful approach for deriving information for a depth map would be the following: a) segment the image into two regions: moving and stationary; b) segment the moving areas into distinct objects using various clues such as color and proximity; c) find distinct matching points on the moving objects in two different frames; d) determine depths for those matching points based on the measured point displacements; e) interpolate the depth map for non-matched moving object points.
As a second special case, consider the situation where nothing is moving in the video sequence for many frames in a row. In this case, occlusion becomes a major depth cue. If one object is in front of another, then it will occlude the background object, and it must be closer. If an image can be segmented into objects, and an occlusion map can be deduced, then different depths can be assigned to different objects based on where they lie in the occlusion map. Other clues that may be algorithmically exploitable could stem from perspective considerations applied to the edges of identified objects.
Many powerful depth clues will be hard to take advantage of algorithmically—although humans can exploit them easily—because they involve recognizing objects. For example, we can easily recognize two humans in a picture, and determine whether or not they are adults or children. We know that if two adult males appear in the picture, and one appears substantially taller than the other (and isn’t holding a basketball), then the shorter one is further away. I suspect that taking advantage of this sort of knowledge is beyond the capability of today’s real-time (and non real-time!) processing. Nevertheless, I was amazed at how well the systems I saw at the show appeared to work.
With regards to the second step, given an image and a depth map, a second view can be created from the first by displacing each pixel of the original image with a disparity value corresponding to its depth. In practice it won’t be that easy. Why? Because after displacing pixels with their appropriate disparities, gaps will appear in the new image. These gaps result from image detail that is visible in one image but not in the other, so the gaps will need to be interpolated or otherwise synthesized in some reasonable way.
With regards to 3D video quality assessment, I just want to interject a note of caution. I was encouraged to see that several vendors have made progress in developing systems that automatically approximate the “mean opinion scores” that subjective human evaluation tests would assign to various image sequences. However, when dealing with 3D video, the sum is greater than the parts. If the algorithmic approach implemented is to naively apply 2D image quality assessment to the left and right pictures independently, and then average the scores together, the result is likely to not correspond at all to a human’s subjective viewing experience. For those of you who wear glasses, like me, you can experience this directly if one of your eyes is better than the other. Take off our glasses, look at the world around you, and you will see it with the resolution of your better eye; but you will still have stereoscopic vision. This effect will ultimately need to be taken into account in automatic systems that purport to algorithmically assess the quality of a 3D image sequence.
Encoders Aren’t Commodities
My partner Ben Mesander had a really cool post the other day: An h.264 encoder written in 30 lines of C code.
Ben’s encoder outputs completely valid h.264, but it doesn’t actually compress anything. (What do you expect from 30 lines!) In fact, because of the necessary h.264 headers, the output of Ben’s encoder is larger than the input.
This is a dramatic example of something that I find interesting about the codec marketplace: Decoders are commodities, but encoders are highly differentiated. People often mis-understand this dynamic, however.
A video decoder, if it works, has to follow the relevant specification. There are hundreds of “tricks” that a baseline profile h.264 encoder could use, and so a baseline-profile decoder must be able to handle all of them. So there’s really not room for a lot of differentiation between h.264 decoders. Sure, one decoder might use less CPU than another. But mostly, if you’re looking to buy a decoder, you should shop based on price.
Another way to say the same thing is that a codec specification details how to write a decoder. The spec lays out what a compliant bitstream looks like, and specifies how you turn that bitstream into video or audio.
Encoders, as Ben showed, are completely different beasts. An encoder author can pick which of the tools provided by the standard he or she will use. In the extreme case, as Ben did, he can choose to use almost none of the tools. Therefore, there can be a huge difference in compression efficiency—and thus video quality—between two encoders.
You might think this is obvious, but if so you should walk around the security industry’s ISC West trade show this week. You will find all sorts of vendors claiming that their h.264 DVR is the same as their competitor’s DVR, or claiming that their h.264 IP camera is better than a MPEG-4 IP camera. Maybe so, and maybe not: Just because h.264 is a more modern and complex codec than MPEG-4 part 2, it doesn’t automatically follow that a particular h.264 encoder is better than a particular MPEG-4 encoder.
Ultimately, the only way to compare two encoders is a head-to-head bakeoff, where each encoder is set to the same data rate and fed the same content, and you view decoded video from the two at the same time.
Howdy Pierce is a managing partner of Cardinal Peak with a technical background in multimedia systems, software engineering and operating systems.
World’s Smallest h.264 Encoder
Recently I have been studying the h.264 video codec and reading the ISO spec. h.264 a much more sophisticated codec than MPEG-2, which means that a well-implemented h.264 encoder has more compression tools at its disposal than the equivalent MPEG-2 encoder. But all that sophistication comes at a price: h.264 also has a big, complicated specification with a plethora of options, many of which are not commonly used, and it takes expertise to understand which parts are important to solve a given problem.
As a bit of a parlor trick, I decided to write the simplest possible h.264 encoder. I was able to do it in about 30 lines of code—although truth in advertising compels me to admit that it doesn’t actually compress the video at all!
While I don’t want to balloon this blog post with a detailed description of h.264, a little background is in order. An h.264 stream contains the encoded video data along with various parameters needed by a decoder in order to decode the video data. To structure this data, the bitstream consists of a sequence of Network Abstraction Layer (NAL) units.
Previous MPEG specifications allowed pictures to be coded as I-frames, P-frames, or B-frames. h.264 is more complex and wonderful. It allows individual frames to be coded as multiple slices, each of which can be of type I, P, or B, or even more esoteric types. This feature can be used in creative ways to achieve different video coding goals. In our encoder we will use one slice per frame for simplicity, and we will use all I-frames.
As with previous MPEG specifications, in h.264 each slice consists of one or more 16×16 macroblocks. Each macroblock in our 4:2:0 sampling scheme contains 16×16 luma samples, and two 8×8 blocks of chroma samples. For this simple encoder, I won’t be compressing the video data at all, so the samples will be directly copied into the h.264 output.
With that background in mind, for our simplest possible encoder, there are three NALs we have to emit:
- Sequence Parameter Set (SPS): Once per stream
- Picture Parameter Set (PPS): Once per stream
- Slice Header: Once per video frame
- Slice Header information
- Macroblock Header: Once per macroblock
- Coded Macroblock Data: The actual coded video for the macroblock
Since the SPS, the PPS, and the slice header are static for this application, I was able to hand-code them and include them in my encoder as a sequence of magic bits.
Putting it all together, I came up with the following code for what I call “hello264”:
#include <stdio.h>
#include <stdlib.h>
/* SQCIF */
#define LUMA_WIDTH 128
#define LUMA_HEIGHT 96
#define CHROMA_WIDTH LUMA_WIDTH / 2
#define CHROMA_HEIGHT LUMA_HEIGHT / 2
/* YUV planar data, as written by ffmpeg */
typedef struct
{
uint8_t Y[LUMA_HEIGHT][LUMA_WIDTH];
uint8_t Cb[CHROMA_HEIGHT][CHROMA_WIDTH];
uint8_t Cr[CHROMA_HEIGHT][CHROMA_WIDTH];
} __attribute__((__packed__)) frame_t;
frame_t frame;
/* H.264 bitstreams */
const uint8_t sps[] = { 0x00, 0x00, 0x00, 0x01, 0x67, 0x42, 0x00,
0x0a, 0xf8, 0x41, 0xa2 };
const uint8_t pps[] = { 0x00, 0x00, 0x00, 0x01, 0x68, 0xce,
0x38, 0x80 };
const uint8_t slice_header[] = { 0x00, 0x00, 0x00, 0x01, 0x05, 0x88,
0x84, 0x21, 0xa0 };
const uint8_t macroblock_header[] = { 0x0d, 0x00 };
/* Write a macroblock's worth of YUV data in I_PCM mode */
void macroblock(const int i, const int j)
{
int x, y;
if (! ((i == 0) && (j == 0)))
{
fwrite(¯oblock_header, 1, sizeof(macroblock_header),
stdout);
}
for(x = i*16; x < (i+1)*16; x++)
for (y = j*16; y < (j+1)*16; y++)
fwrite(&frame.Y[x][y], 1, 1, stdout);
for (x = i*8; x < (i+1)*8; x++)
for (y = j*8; y < (j+1)*8; y++)
fwrite(&frame.Cb[x][y], 1, 1, stdout);
for (x = i*8; x < (i+1)*8; x++)
for (y = j*8; y < (j+1)*8; y++)
fwrite(&frame.Cr[x][y], 1, 1, stdout);
}
/* Write out PPS, SPS, and loop over input, writing out I slices */
int main(int argc, char **argv)
{
int i, j;
fwrite(sps, 1, sizeof(sps), stdout);
fwrite(pps, 1, sizeof(pps), stdout);
while (! feof(stdin))
{
fread(&frame, 1, sizeof(frame), stdin);
fwrite(slice_header, 1, sizeof(slice_header), stdout);
for (i = 0; i < LUMA_HEIGHT/16 ; i++)
for (j = 0; j < LUMA_WIDTH/16; j++)
macroblock(i, j);
fputc(0x80, stdout); /* slice stop bit */
}
return 0;
}
(This source code is available as a single file here.)
In main(), the encoder writes out the SPS and PPS. Then it reads YUV data from standard input, stores it in a frame buffer, and then writes out a h.264 slice header. It then loops over each macroblock in the frame and calls the macroblock() function to output a macroblock header indicating the macroblock is coded as I_PCM, and inserts the YUV data.
To use the code, you will need some uncompressed video. To generate this, I used the ffmpeg package to convert a QuickTime movie from my Kodak Zi8 video camera from h.264 to SQCIF (128×96) planar YUV format sampled at 4:2:0:
ffmpeg.exe -i angel.mov -s sqcif -pix_fmt yuv420p angel.yuv
I compile the h.264 encoder:
gcc –Wall –ansi hello264.c –o hello264
And run it:
hello264 <angel.yuv >angel.264
Finally, I use ffmpeg to copy the raw h.264 NAL units into an MP4 file:
ffmpeg.exe -f h264 -i angel.264 -vcodec copy angel.mp4
Here is the resulting output:
There you have it—a complete h.264 encoder that uses minimal CPU cycles, with output larger than its input!
The next thing to add to this encoder would be CAVLC coding of macroblocks and intra prediction. The encoder would still be lossless at this point, but there would start to be compression of data. After that, the next logical step would be quantization to allow lossy compression, and then I would add P slices. As a development methodology, I prefer to bring up a simplistic version of an application, get it running, and then add refinements iteratively.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
The Math Behind Analog Video Resolution
The world is moving in the direction of HDTV, but NTSC “standard def” signals are still common for many reasons and will remain so. One important reason is that cameras that output NTSC are widely available and cheap! Many applications, including a lot of security applications, simply don’t require the resolution of HDTV…and don’t want to incur the camera cost and bandwidth hit it requires.
So, what is resolution anyway, especially with regards to an analog NTSC video signal? Analog video cameras, especially in the CCTV industry, are sold using “horizontal TV lines”, or HTVL, as one of their key specifications. Unfortunately, the math behind that concept is not well understood.
To understand resolution we start with the aspect ratio. The aspect ratio of a picture is the ratio of its width to its height. Different aspect ratios are in use today for different applications. HDTV has an aspect ratio of 16:9. Standard definition TV has an aspect ratio of 4:3. In general, resolution is measured in a circle whose diameter is equivalent to a picture’s smallest dimension. The diagram below illustrates the case for NTSC:

In the above diagram, the circle has a diameter of 3 units or 1 “picture height”.
Now, imagine a uniformly spaced sequence of vertical black lines of constant width. The white space between the black lines should be the same width as the black lines themselves. Counting both black and white lines, how many lines can be physically resolved within the above circle? The answer to this question is the horizontal resolution. Before we can go further, we need some facts regarding NTSC:
First, there are 525 scan lines per picture, and the horizontal scanning frequency is
![]()
This is one of NTSC’s magic numbers. The horizontal line time is therefore approximately 1 / 15,734.26573, or 63.5556 µsec.
Second, the horizontal blanking period is 10.7 µsec. During this period a horizontal sync pulse is transmitted, as well as a chroma burst (to enable decoders to demodulate the correct color), and a reference black level. The active line time—the period during which information is actually being drawn on the visible screen—is therefore 63.5556 – 10.7, or 52.8556 µsec.
Finally, the highest broadcast luminance signal is 4.2 MHz.
Based on the above, we can compute the highest horizontal resolution that can be present in an NTSC signal as follows:
The product of the middle two parameters is the number of complete cycles present in one active line (the MHz and microseconds cancel); the factor of 2 is present because we count both the white and black lines in the horizontal resolution calculation. Multiplying by three-fourths takes into account the circle in which the horizontal resolution is defined. Bear in mind that this pattern would be displayed on an NTSC TV as grey, not as a crisp sequence of black and white lines, due to the rolloff of the various filters used to limit the video bandwidth.
In the vertical direction, resolution is limited by the number of scan lines. There are 480 scan lines in the visible area of a picture, so one would be tempted to assert that the vertical resolution is 480. However, imagine a uniformly spaced sequence of horizontal lines analogous to the vertical line pattern described above. We want this horizontal pattern to be discernible regardless of its relative relationship to the scanning lines. In other words, as the pattern is displaced vertically, the number of lines should still be easily resolved.
Imagine that we have a pattern of 480 horizontal lines, alternating black and white. When these lines are exactly midway between the scanning lines, the resulting picture will be grey. Why? Because the scanning will average the black and white inputs together for each reproduced line. So a pattern of 480 lines would not be discernible: the resolution must be less. The Kell factor measures by how much the vertical resolution is reduced relative to the number of scan lines, and it is usually assumed to be around 0.7 for a stationary pattern. This implies the following vertical resolution:
The horizontal and vertical resolutions are therefore approximately equal. This was one of the design goals of NTSC.
Note that the Kell factor is sometimes assumed to be a larger number for a pattern in motion because visual averaging will cause the eye to “ignore” the occasional grey or blurry pattern. Values as high as 0.9 are assumed for moving images.
Note further that the reason the Kell factor does not apply horizontally is that it is possible to put down the dots on a CRT so close that regardless of the horizontal phase of a 4.2 Mhz vertical line pattern, it will be resolved. However, the Kell factor does apply in the horizontal direction for a digital display such as a computer monitor or LCD panel.
In a digital SDTV system based on the CCIR 601 digital sampling standard, the luminance information is sampled at 13.5 Mhz. The number of samples per active line is therefore given by 13.5 × 52.8556 = 713.56. This number is often rounded up to 720 samples; rounding up provides some headroom on either side of the visible line to hide the edge effects of various digital processing operations such as filtering. Plus, margin is generally good in any design!
Nyquist sampling theory says that sampling at 13.5 Mhz would theoretically allow a horizontal frequency as high as 13.5 / 2 = 6.75 Mhz to be captured. However, because of the inability to implement perfect anti-aliasing filters, this cannot be achieved in practice. A reduction factor of 0.75 is appropriate, implying that the CCIR-601 sampling standard is good for horizontal frequencies as high as 0.75 × 6.75 = 5.06 Mhz. This is substantially better than old-fashioned analog NTSC broadcasts. Therefore, on a good monitor with analog component input, a CCIR-601 signal can achieve the following horizontal resolution:
For a SIF resolution picture (360×240), we are effectively sampling at 6.75 Mhz, not 13.5 Mhz, so the highest horizontal frequency that can be reproduced is 0.75 × (6.75/2) = 2.53 Mhz, which corresponds to a horizontal resolution of:
Bear in mind that to achieve this number one needs to do an excellent job of filtering.
In the case of displaying a SIF picture on a computer monitor, we can approximate the horizontal resolution by using a Kell factor in the horizontal direction. This yields the following:
In the above, the 0.7 is the Kell factor and the 0.75 is to account for the aspect ratio (remember, resolution is computed inside the circle). A value of 360 was used for purity’s sake. The MPEG world uses 352 and 704 because they are related by a factor of 2 and both are multiples of 16 (360 is not a multiple of 16). It also means there is a little less data to compress!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.