Thoughts on 3D After NAB

I just returned from this year’s NAB show, where I was bombarded with 3D demos in virtually every booth. Most of the factors driving this 3D superabundance originate outside of the broadcast industry itself. First, TV manufacturers are hot on 3D as a way to get everyone who just bought an HDTV to upgrade to a new 3D-enabled display. Cinema owners like 3D because they can charge more for the tickets. The Blu-ray consortium recently standardized a method for storing 3D video on BD discs and hopes to enable and piggyback on the efforts of the TV manufacturers as a way to drive sales of 3D Blu-ray players and finally displace DVDs. Hollywood has begun producing more 3D movies too, including the phenomenally successful Avatar (which will be released on 3D Blu-ray very soon). So naturally, the broadcast industry needs to be prepared to author and carry 3D signals.

Most of the demos I saw were nothing more than “here, put on these glasses” — in other words, “me too”-type demos. (Although after seeing all this 3D interest I did wonder if Philips regretted terminating their auto stereographic display effort last year!).

Nevertheless, I did see two problems addressed that I found technically interesting. First, real-time 2D to 3D conversion (see here and here), and second, automatic 3D quality monitoring.

I worked a lot on the problem of compressing stereo pairs as part of my Ph.D. research, and I also spent time thinking about 3D video quality assessment. However, I had never considered the problem of real-time 2D to 3D conversion, so the show got me thinking. It’s a pretty tricky problem!

Converting a 2D video stream to 3D can be partitioned into two fundamental steps. First, creating a depth map for each video image, and second, using the depth map to construct a second viewpoint. Although both steps are challenging, the first step feels substantially harder to me.

With regards to the first step, a sequence of 2D video images must be analyzed to extract a depth map. Several special cases are worth discussing, but I’ll only mention two. First, consider the case where the camera is stationary and a 3D object moves through the field of view. The closer points on that object will have frame-to-frame pixel displacements that are larger than those for object points that are further away. Therefore, one useful approach for deriving information for a depth map would be the following: a) segment the image into two regions: moving and stationary; b) segment the moving areas into distinct objects using various clues such as color and proximity; c) find distinct matching points on the moving objects in two different frames; d) determine depths for those matching points based on the measured point displacements; e) interpolate the depth map for non-matched moving object points.

As a second special case, consider the situation where nothing is moving in the video sequence for many frames in a row. In this case, occlusion becomes a major depth cue. If one object is in front of another, then it will occlude the background object, and it must be closer. If an image can be segmented into objects, and an occlusion map can be deduced, then different depths can be assigned to different objects based on where they lie in the occlusion map. Other clues that may be algorithmically exploitable could stem from perspective considerations applied to the edges of identified objects.

Many powerful depth clues will be hard to take advantage of algorithmically — although humans can exploit them easily — because they involve recognizing objects. For example, we can easily recognize two humans in a picture and determine whether or not they are adults or children. We know that if two adult males appear in the picture, and one appears substantially taller than the other (and isn’t holding a basketball), then the shorter one is further away. I suspect that taking advantage of this sort of knowledge is beyond the capability of today’s real-time (and non-real-time!) processing. Nevertheless, I was amazed at how well the systems I saw at the show appeared to work.

With regards to the second step, given an image and a depth map, a second view can be created from the first by displacing each pixel of the original image with a disparity value corresponding to its depth. In practice, it won’t be that easy. Why? Because after displacing pixels with their appropriate disparities, gaps will appear in the new image. These gaps result from image detail that is visible in one image but not in the other, so the gaps will need to be interpolated or otherwise synthesized in some reasonable way.

With regard to 3D video quality assessment, I just want to interject a note of caution. I was encouraged to see that several vendors have made progress in developing systems that automatically approximate the “mean opinion scores” that subjective human evaluation tests would assign to various image sequences. However, when dealing with 3D video, the sum is greater than the parts. If the algorithmic approach implemented is to naively apply 2D image quality assessment to the left and right pictures independently, and then average the scores together, the result is likely to not correspond at all to a human’s subjective viewing experience. For those of you who wear glasses, like me, you can experience this directly if one of your eyes is better than the other. Take off our glasses, look at the world around you, and you will see it with the resolution of your better eye; but you will still have stereoscopic vision. This effect will ultimately need to be taken into account in automatic systems that purport to algorithmically assess the quality of a 3D image sequence.