Archive for the ‘Perk’ Category
If only we had better test content…
I just saw this news about research that says you notice compression artifacts less if you like the content of a particular video clip:
Using four studies, Kortum, along with co-author Marc Sullivan of AT&T Labs, showed 100 study participants 180 movie clips encoded at nine different levels, from 550 kilobits per second up to DVD quality. Participants viewed the two-minute clips and then were asked about the video quality of the clips and desirability of the movie content.
Kortum found a strong correlation between the desirability of movie content and subjective ratings of video quality.
(The original paper seems to be here, beyond a pay wall.)
Makes me wonder about the classic test footage with the calendar and the model train!
The Basics of 3D Image Acquisition
One of our clients is heavily involved in 3D video and has been for several years. However, several are just now starting to think about it because of the uptick of interest in the consumer electronics world. Enough questions have been posed to us recently that it seemed worthwhile to me to pull together a few basic facts regarding 3D stereopair imaging and stereo disparity.
First, we need a simple model of a lens. Consider the diagram below:
In this picture, the long horizontal line that passes through the center of the lens is called the lens axis. The lens has the property that rays that pass through the center of the lens are undeviated. Therefore, the ray from the top of the tree, at a distance l to the left of the lens, passes straight through the center of the lens. (The tree has a height of h.) The lens also has the property that rays that arrive perpendicular to the lens are refracted to pass through the focal point of the lens. The focal point lies on the lens axis and is a distance f from the center of the lens. The intersection of these two rays shows where the image of the tree will be formed. You can see that the image of the tree is upside down, and has a new height h’. The image is formed a distance d to the right of the focal point.
By using similar triangles we see first that
Using a different pair of similar triangles we also see that
Solving the first equation above for h’, substituting the result into the second equation and simplifying, we derive the following relationship:
This is the fundamental equation of a simple lens. It shows that as the object gets further and further from the lens, i.e. as l increases, the distance of the image of the object from the focal plane decreases, i.e. d gets smaller. We can assume that the camera’s image sensor is located at a distance f from the lens, is perpendicular to the lens axis, and that all objects more than a certain distance away from the lens will be in focus. In other words, the image of all sufficiently distant objects will appear on the focal plane where the image sensor is located.
In the case of 3D video, two cameras are used to acquire a sequence of stereopair images, one from the left camera and one from the right. Different stereo geometries are possible, but the most common one is to place the two cameras horizontally apart from each other by a distance i, and to keep their focal planes coplanar. The diagram below illustrates this configuration:
The horizontal line at the bottom is the focal plane; it is clear from the diagram that the focal planes are coplanar. The lenses are a distance f from the focal plane and are separated by a distance of i from each other. We assume that a small object (or a point on a larger object) is located a distance l from the lens plane and a distance m to the right of the axis of the right lens. We want to know where the image of that object appears in the left and the right camera. In particular, we want to know if we overlaid the left image on top of the right image, how far apart would the images appear? Mathematically, we want to know the disparity, which we define to be
where s1 and s2 are the distances from the image point to the intersection of the lens axis with the focal plane for the left and the right cameras respectively. Note that we are assuming that the object being imaged is far enough away that its image forms on the focal plane.
Using our favorite trick of similar triangles we have the following two equations:
and
Solving the first equation for s1, the second equation for s2, taking the difference and simplifying yields
Although this expression was derived for an object to the right of the axis of the right camera, it is easy to show in a similar manner that it is also true for an object between the axes of the two cameras as well as for an object to the left of the axis of the left camera.
So what does this equation tell us? First, it says that for this particular camera geometry, the disparity is only a function of the separation between the two cameras, i, and the distance of the object from the lens plane, l. Second the equation tells us that the disparity increases as we increase the separation between the cameras. Finally, it tells us that the disparity decreases as the object gets further away from the cameras, approaching zero for objects an infinite distance away. (You can see this when you watch 3D content without wearing the special 3D glasses: The “distant” objects can be seen by the naked eye, whereas the near objects appear blurry to the naked eye, because the value of ρ is greater.)
It should be clear from this equation that if a stereopair is available, and corresponding points can be found in the left and right pictures, that the disparity between those points can be measured, and the distance to the point can be computed.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Thoughts on 3D after NAB
I just returned from this year’s NAB show, where I was bombarded with 3D demos in virtually every booth. Most of the factors driving this 3D superabundance originate outside of the broadcast industry itself. First, TV manufacturers are hot on 3D as a way to get everyone who just bought an HDTV to upgrade to a new 3D enabled display. Cinema owners like 3D because they can charge more for the tickets. The Blu-ray consortium recently standardized a method for storing 3D video on BD discs, and hopes to enable and piggyback on the efforts of the TV manufacturers as a way to drive sales of 3D Blu-ray players and finally displace DVDs. Hollywood has begun producing more 3D movies too, including the phenomenally successful Avatar (which will be released on 3D Blu-ray very soon). So naturally the broadcast industry needs to be prepared to author and carry 3D signals.
Most of the demos I saw were nothing more than “here, put on these glasses”…in other words, “me too” type demos. (Although after seeing all this 3D interest I did wonder if Philips regretted terminating their auto stereographic display effort last year!).
Nevertheless, I did see two problems addressed that I found technically interesting. First, real-time 2D to 3D conversion (see here and here), and second, automatic 3D quality monitoring.
I worked a lot on the problem of compressing stereopairs as part of my Ph.D. research, and I also spent time thinking about 3D video quality assessment. However, I had never considered the problem of real-time 2D to 3D conversion, so the show got me thinking. It’s a pretty tricky problem!
Converting a 2D video stream to 3D can be partitioned into two fundamental steps. First, creating a depth map for each video image, and second, using the depth map to construct a second viewpoint. Although both steps are challenging, the first step feels substantially harder to me.
With regards to the first step, a sequence of 2D video images must be analyzed to extract a depth map. Several special cases are worth discussing, but I’ll only mention two. First, consider the case where the camera is stationary and a 3D object moves through the field of view. The closer points on that object will have frame-to-frame pixel displacements that are larger than those for object points that are further away. Therefore, one useful approach for deriving information for a depth map would be the following: a) segment the image into two regions: moving and stationary; b) segment the moving areas into distinct objects using various clues such as color and proximity; c) find distinct matching points on the moving objects in two different frames; d) determine depths for those matching points based on the measured point displacements; e) interpolate the depth map for non-matched moving object points.
As a second special case, consider the situation where nothing is moving in the video sequence for many frames in a row. In this case, occlusion becomes a major depth cue. If one object is in front of another, then it will occlude the background object, and it must be closer. If an image can be segmented into objects, and an occlusion map can be deduced, then different depths can be assigned to different objects based on where they lie in the occlusion map. Other clues that may be algorithmically exploitable could stem from perspective considerations applied to the edges of identified objects.
Many powerful depth clues will be hard to take advantage of algorithmically—although humans can exploit them easily—because they involve recognizing objects. For example, we can easily recognize two humans in a picture, and determine whether or not they are adults or children. We know that if two adult males appear in the picture, and one appears substantially taller than the other (and isn’t holding a basketball), then the shorter one is further away. I suspect that taking advantage of this sort of knowledge is beyond the capability of today’s real-time (and non real-time!) processing. Nevertheless, I was amazed at how well the systems I saw at the show appeared to work.
With regards to the second step, given an image and a depth map, a second view can be created from the first by displacing each pixel of the original image with a disparity value corresponding to its depth. In practice it won’t be that easy. Why? Because after displacing pixels with their appropriate disparities, gaps will appear in the new image. These gaps result from image detail that is visible in one image but not in the other, so the gaps will need to be interpolated or otherwise synthesized in some reasonable way.
With regards to 3D video quality assessment, I just want to interject a note of caution. I was encouraged to see that several vendors have made progress in developing systems that automatically approximate the “mean opinion scores” that subjective human evaluation tests would assign to various image sequences. However, when dealing with 3D video, the sum is greater than the parts. If the algorithmic approach implemented is to naively apply 2D image quality assessment to the left and right pictures independently, and then average the scores together, the result is likely to not correspond at all to a human’s subjective viewing experience. For those of you who wear glasses, like me, you can experience this directly if one of your eyes is better than the other. Take off our glasses, look at the world around you, and you will see it with the resolution of your better eye; but you will still have stereoscopic vision. This effect will ultimately need to be taken into account in automatic systems that purport to algorithmically assess the quality of a 3D image sequence.
Delta Sigma Converters: Filtering, Decimation, and Simulations
In my first post on ΔΣ converters I presented an intuitive way to derive the modulator portion of the converter. Now we need to look at what comes after the modulator—namely, the digital filter and the decimator. The high-level structure of the converter looks like this:
The analog input voltage, v(t), is assumed to be scaled so that its range is in the interval [-1, 1]. The output of the modulator is a sequence of uniform width pulses with a magnitude of either ‑1 or 1. These pulses are generated at an “oversampled” rate—in other words, at a rate greater than the Nyquist rate. The oversampling is by some significant factor, for example, 64. Let the highest frequency present in the signal be B; the Nyquist rate is then 2B, and the output rate of the modulator for an oversampling factor of 64 is 128B.
We wish to reduce the sample rate to the Nyquist rate while extracting the signal v(t) from the pulses. We do this by lowpass filtering the pulse train to a bandwidth of B and then sampling the filter’s output at the rate 2B. Consider a FIR LPF filter. Because the pulse sequence only has 1s and -1s in it, the FIR filter can be implemented with additions and subtractions only. Furthermore, although the 1s and -1s are shifted into the filter at the oversampled rate, the filter’s output only needs to be computed at the Nyquist rate, which is nice from a computational perspective because we only need to compute the samples we’re going to keep!
To make this all clear, let’s trace the flow of a simple input signal through the system in both the time and frequency domain. For this example, I’ll let B be 4KHz, which corresponds to telephone quality audio. Let’s assume an oversampling factor of 64. Then pulses are output from the modulator at the rate 2 × 4000 × 64 = 512 kilobits/sec (bearing in mind that we convert the sequence of -1s and 1s into an equivalent sequence of 0s and 1s, so each pulse is captured as a single bit). As an input signal we’ll use the sum of two cosines, one at the low end of the pass band and one at the high end (730 Hz and 3.765 KHz respectively). Both cosines have the same amplitude, 0.5, so their sum is within the [-1,1] range required. The input signal looks like this:
The spectrum of this signal computed over a 4-second interval is shown below. Here the x-axis is in KHz and the y-axis is in dB relative to the highest power frequency in the signal.
As expected, the input signal has two well-defined peaks.
The sequence of pulses output from the ΔΣ modulator when this signal is input looks like this:
In intervals where the input signal is large, many 1s are output, while in intervals where the input signal is close to zero, 1s and -1s occur with a nearly equal frequency. Since we propose to recover the input signal by lowpass filtering this pulse stream, it is interesting to look at the spectrum of the pulse stream. In the range of 0 to 15 KHz it looks like this:
The original signal spectrum is clearly visible, but there is lots of noise above 4 KHz. In fact, if we plot all the way out to 256 KHz (1/2 the oversampled rate of 512 KHz) we see the following:
The 1-bit sampling has introduced a huge amount of quantization noise, and it increases with frequency. Were we to subsample the signal at an 8 KHz rate without first lowpass filtering, all of this noise would alias into the 4 KHz band we care about! To do the lowpass filtering I used a 1000-tap FIR filter in my simulation; I wanted to show the effect of a very good filter. After filtering, the spectrum looks like this:
Most of the out-of-band quantization noise has now been eliminated. After subsampling the signal by a factor of 64, we are down to the Nyquist rate. Any quantization noise remaining after the LPF has now been aliased into our frequency band of interest. The spectrum of the Nyquist sampled signal over the range 0 to 4 KHz is shown below; the two cosine signals have been converted with very little distortion.
As a final note, it’s worth showing the spectrum of the modulator’s output pulse train when the oversampling factor is only 4 instead of 64. As seen below (the graph covers the range 0 to 15 KHz), there is a lot more in-band noise (i.e. in the range 0 – 4 KHz) than occurs with 64x oversampling. Not surprisingly, higher sampling rates lead to better performance!
Here is the code I used for these simulations.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Delta Sigma Converters: Modulation
The web is filled with introductions to Delta Sigma modulation (also sometimes referred to as Sigma Delta modulation) in the context of Delta Sigma converters. Unfortunately, the ones I’ve looked at fail to intuitively motivate how the modulator works. Therefore, my goal in this post is to show how the structure of a first-order ΔΣ modulator can be simply understood. In particular, I show how its structure can be derived from a trivial 10-line C program that performs the ΔΣ modulator function.
The basic problem solved by a ΔΣ modulator is this: given a fixed input value ν that lies in the range [-1, 1], the modulator outputs a pulse train satisfying the following criteria:
- Each pulse has the same fixed width τ
- Each pulse has an amplitude of -1 or 1 only
- The time average value of the pulse train emitted converges to ν
To derive from first principles a system diagram to perform this task, I find it useful to proceed as follows: First, write down a few equations to make precise what we’re after; next, write a computer program that emits a pulse train satisfying the equations; and finally, draw a block diagram that realizes the computer program.
So let’s get started. First, a waveform f(t) has an average value on the interval [0,T] of
![]()
Let l(i) be the magnitude of the pulse of duration t emitted in the interval [iτ, (i + 1)τ]; l(i) must be 1 or ‑1. The average value of the pulse train waveform f(t) corresponding to a sequence of pulses is given by
![]()
So we don’t need to worry about τ, which is nice! Here we have defined S(N) to be the sum of the l(i) values from 0 to N – 1, so S(1) = l(0), S(2) = l(0) + l(1), and so on.
How can we write a computer program to decide whether the next pulse to output should be a 1 or a -1? I think the first algorithm most of us would think of is the following: “keep track of the average value, S(N) / N, at each instant. If the current average is less than ν, output a 1 next; otherwise output a -1.” I will skip a formal proof that the pulse train emitted by this algorithm converges as desired, but it is intuitively reasonable that it does, because each additional output pulse moves the average up or down by a smaller amount than the previously emitted pulse, and each pulse nudges the average in the desired direction.
It’s straightforward to write a computer program that accomplishes the above. In C-like pseudo code the program looks like this:
N=1;
while (1) {
S_N = compute_current_S_N();
if ( (S_N/N) < v )
output(1);
else
output(-1);
N++;
}
This can be equivalently written as
N=1;
while (1) {
S_N = compute_current_S_N();
if( N*v – S_N > 0 )
output(1);
else
output(-1);
N++;
}
Now it’s time to draw a system diagram to implement this algorithm. Clearly it will contain some sort of feedback, because the pulse value to output next depends on the past sequence of output values. There will also be a need to compute a difference, and somehow we will need to accumulate the values S(N) and Nν. Consider the system diagram below:
Lets analyze this diagram to see if it does the right thing. I’ve labeled the input ν and the feedback line l. With respect to the feedback line, the box with a z in it means “delay by one time instant”; the indexes of l on either side of this box reflect that delay. The box with a plus sign in it is an instantaneous summer, which has been configured in this case to take the difference between its two inputs. The large box is a quantizer, and the graph within it shows the input/output transfer function q( ).The graph means that any input value greater than 0 will result in an output value of 1, while any input value less than 0 will result in an output value of ‑1. Finally, the box with a capital sigma (Σ) in it is an accumulator, also know as an integrator. Its output is the sum of all its previous inputs.
To see if this system diagram results in the required stream of pulses l(i), we need to analyze the variable y. This is most easily done by creating a table as shown below:
| N | l(N-1) | y(N) | l(N) |
| 0 | N/A | y(0) is initialized to 0 | l(0) is initialized to 1 |
| 1 | l(0) = 1 | y(1) = ν-l(0) | l(1)=q( y(1) ) |
| 2 | l(1) | y(2) = (ν-l(0)) + (ν-l(1)) = 2ν – l(0) – l(1) |
l(2) = q( y(2) ) |
| 3 | l(2) | y(3) = 3ν – l(0) – l(1) – l(2) = 3ν – S(3) |
l(3) = q( y(3) ) |
| …and in the general case | |||
| N | l(N-1) | y(N) = Nν – S(N) | l(N) = q( y(N) ) |
The quantizer is executing the “if then” portion of the computer program, and the summer and accumulator are computing the running sum needed as the argument for the if function (i.e., as the input to the quantizer). This system will do the trick!
This diagram is often written in a slightly different form to more closely model its realization in hardware as follows:
All that is going on here is that the quantizer in the first diagram is replaced with the combination of an analog-to-digital converter (ADC) and a digital-to-analog converter (DAC). The ADC outputs a 1 if its input is greater than 0 and a 0 if its input is less than zero, so a bitstream is created at the indicated point. The DAC converts the bitstream back to a sequence of -1 and 1 values. Finally, to indicate the analog nature of this modulator when used as part of a ΔΣ converter, the accumulator is replaced by an integrator:
Now, assume the ΔΣ modulator above is used as part of an analog-to-digital converter. If the input signal changes slowly relative to the rate at which bits are produced, then many bits will be generated before the signal changes significantly, and the average value the bits encode will be close to the true value. In other words, the 1s and -1s the bitstream represents have time to “average out” to an accurate representation of the input signal. Of course, this average still has to be computed numerically after the modulator; more on that later. However, the faster the input signal changes, the fewer bits will be generated before the signal has changed significantly, and the less accurately the bitstream will represent the signal. At an intuitive level, this is why the quantization noise of ΔΣ converters increases as a function of the input frequency of the signal.
When used as part of an analog-to-digital converter, the modulator is followed by a digital filter and a decimator. The digital filter computes the averages encoded in the bitstream; the decimator reduces the sample rate from the high rate of the 1-bit ADC to the Nyquist rate associated with the final multi-bit samples. The binary representation of the decimated samples is used as the AD output bits.
Part II discusses filtering and decimation in greater detail, and presents some simulation results.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
The Math Behind Analog Video Resolution
The world is moving in the direction of HDTV, but NTSC “standard def” signals are still common for many reasons and will remain so. One important reason is that cameras that output NTSC are widely available and cheap! Many applications, including a lot of security applications, simply don’t require the resolution of HDTV…and don’t want to incur the camera cost and bandwidth hit it requires.
So, what is resolution anyway, especially with regards to an analog NTSC video signal? Analog video cameras, especially in the CCTV industry, are sold using “horizontal TV lines”, or HTVL, as one of their key specifications. Unfortunately, the math behind that concept is not well understood.
To understand resolution we start with the aspect ratio. The aspect ratio of a picture is the ratio of its width to its height. Different aspect ratios are in use today for different applications. HDTV has an aspect ratio of 16:9. Standard definition TV has an aspect ratio of 4:3. In general, resolution is measured in a circle whose diameter is equivalent to a picture’s smallest dimension. The diagram below illustrates the case for NTSC:

In the above diagram, the circle has a diameter of 3 units or 1 “picture height”.
Now, imagine a uniformly spaced sequence of vertical black lines of constant width. The white space between the black lines should be the same width as the black lines themselves. Counting both black and white lines, how many lines can be physically resolved within the above circle? The answer to this question is the horizontal resolution. Before we can go further, we need some facts regarding NTSC:
First, there are 525 scan lines per picture, and the horizontal scanning frequency is
![]()
This is one of NTSC’s magic numbers. The horizontal line time is therefore approximately 1 / 15,734.26573, or 63.5556 µsec.
Second, the horizontal blanking period is 10.7 µsec. During this period a horizontal sync pulse is transmitted, as well as a chroma burst (to enable decoders to demodulate the correct color), and a reference black level. The active line time—the period during which information is actually being drawn on the visible screen—is therefore 63.5556 – 10.7, or 52.8556 µsec.
Finally, the highest broadcast luminance signal is 4.2 MHz.
Based on the above, we can compute the highest horizontal resolution that can be present in an NTSC signal as follows:
The product of the middle two parameters is the number of complete cycles present in one active line (the MHz and microseconds cancel); the factor of 2 is present because we count both the white and black lines in the horizontal resolution calculation. Multiplying by three-fourths takes into account the circle in which the horizontal resolution is defined. Bear in mind that this pattern would be displayed on an NTSC TV as grey, not as a crisp sequence of black and white lines, due to the rolloff of the various filters used to limit the video bandwidth.
In the vertical direction, resolution is limited by the number of scan lines. There are 480 scan lines in the visible area of a picture, so one would be tempted to assert that the vertical resolution is 480. However, imagine a uniformly spaced sequence of horizontal lines analogous to the vertical line pattern described above. We want this horizontal pattern to be discernible regardless of its relative relationship to the scanning lines. In other words, as the pattern is displaced vertically, the number of lines should still be easily resolved.
Imagine that we have a pattern of 480 horizontal lines, alternating black and white. When these lines are exactly midway between the scanning lines, the resulting picture will be grey. Why? Because the scanning will average the black and white inputs together for each reproduced line. So a pattern of 480 lines would not be discernible: the resolution must be less. The Kell factor measures by how much the vertical resolution is reduced relative to the number of scan lines, and it is usually assumed to be around 0.7 for a stationary pattern. This implies the following vertical resolution:
The horizontal and vertical resolutions are therefore approximately equal. This was one of the design goals of NTSC.
Note that the Kell factor is sometimes assumed to be a larger number for a pattern in motion because visual averaging will cause the eye to “ignore” the occasional grey or blurry pattern. Values as high as 0.9 are assumed for moving images.
Note further that the reason the Kell factor does not apply horizontally is that it is possible to put down the dots on a CRT so close that regardless of the horizontal phase of a 4.2 Mhz vertical line pattern, it will be resolved. However, the Kell factor does apply in the horizontal direction for a digital display such as a computer monitor or LCD panel.
In a digital SDTV system based on the CCIR 601 digital sampling standard, the luminance information is sampled at 13.5 Mhz. The number of samples per active line is therefore given by 13.5 × 52.8556 = 713.56. This number is often rounded up to 720 samples; rounding up provides some headroom on either side of the visible line to hide the edge effects of various digital processing operations such as filtering. Plus, margin is generally good in any design!
Nyquist sampling theory says that sampling at 13.5 Mhz would theoretically allow a horizontal frequency as high as 13.5 / 2 = 6.75 Mhz to be captured. However, because of the inability to implement perfect anti-aliasing filters, this cannot be achieved in practice. A reduction factor of 0.75 is appropriate, implying that the CCIR-601 sampling standard is good for horizontal frequencies as high as 0.75 × 6.75 = 5.06 Mhz. This is substantially better than old-fashioned analog NTSC broadcasts. Therefore, on a good monitor with analog component input, a CCIR-601 signal can achieve the following horizontal resolution:
For a SIF resolution picture (360×240), we are effectively sampling at 6.75 Mhz, not 13.5 Mhz, so the highest horizontal frequency that can be reproduced is 0.75 × (6.75/2) = 2.53 Mhz, which corresponds to a horizontal resolution of:
Bear in mind that to achieve this number one needs to do an excellent job of filtering.
In the case of displaying a SIF picture on a computer monitor, we can approximate the horizontal resolution by using a Kell factor in the horizontal direction. This yields the following:
In the above, the 0.7 is the Kell factor and the 0.75 is to account for the aspect ratio (remember, resolution is computed inside the circle). A value of 360 was used for purity’s sake. The MPEG world uses 352 and 704 because they are related by a factor of 2 and both are multiples of 16 (360 is not a multiple of 16). It also means there is a little less data to compress!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Noise Floor
As shown in a previous post, for samples taken from a zero mean i.i.d noise signal, the expected power of the k’th DFT coefficient is given by
![]()
As discussed in that post, when plotting the power of the k’th coefficient as a function of frequency, the “noise floor” will decrease as N increases…even though the noise power is the same regardless of the value of N chosen.
In situations where it is desirable to avoid this dependency, we can use the DFT to approximate the power spectral density. To see how this is done, consider that the frequency spacing
between DFT coefficients depends on the sampling frequency, Fs, and the number of samples N as follows:
![]()
The power of the k’th coefficient represents the power of all the energy in a bin of width
![]()
Therefore, the power spectral density in Watts/Hz at the frequency corresponding to the k’th coefficient is given by the following ratio
![]()
Note that this value is independent of N. Furthermore, the DFT gives a useful representation of the spectrum over the range
![]()
…so we can get the total power of the signal by multiplying the bandwidth Fs times the power spectral density to get, as expected,
![]()
All of this means that defining the noise floor to be
Watts/Hz
is a good approach. The noise variance is easily measured in the time domain in most systems by just collecting a large number of samples with no input signal applied and then computing the variance of the samples. The sampling rate and the resistor at which the noise samples are collected are normally fixed.
Furthermore, when analyzing data signals in the frequency domain, it should be clear that a spectrum program that plots Watts/Hz is a good choice. The noise floor will appear constant from plot to plot regardless of how many data samples are transformed, and signals will rise above and fall below the noise floor depending on their strength.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Power of a Zero Mean Noise Signal
Part two in a three-part blog series. Read the first entry on power measurements with the DFT.
The concept of the “noise floor” arises repeatedly when looking for signals in the presence of noise. Intuitively, signals disappear when they fall below the noise floor. But Care (yes, that’s a capital C!) must be taken when using this concept to make sure that everyone means the same thing by “noise floor.”
As a stepping stone to a deeper discussion, we need to look at the power of a noise signal in both the time and the frequency domain. Let the samples of a noise signal be Xn, and assume the samples are a sequence of i.i.d zero mean Random Variables (RVs) with variance σ2. This means that the Xk values, the DFT coefficients of the noise sample, are also RVs. What can we say about the distribution of the Xk values?
Recall that by the definition of the DFT,
![]()
Taking the Expected value of both sides we see that:
![]()
This is true because the Xn themselves are zero mean by assumption.
Now, recall from a previous post that the power of the k’th coefficient is given by
![]()
What can we say about
![]()
?
Using the fact that the Xn are independent of each other, as well as being zero mean, it is straightforward to show that
![]()
So we see from substitution that
![]()
Furthermore, the expected power of the noise signal is given by
![]()
Which simplifies to
![]()
This is exactly what we would expect from a time domain calculation of the noise power, which looks like this
![]()
At this point we can clearly see the problem with defining the frequency domain noise floor. Although both the time domain and the frequency domain calculations yield the correct average power, the expected power of the k’th DFT coefficient depends on the number of time domain samples, N.
Consider the following though experiment. Generate a sequence of noise samples with the required statistics, take their DFT, and plot the power of the k’th coefficient as a function of frequency. A noisy horizontal line will be obtained. As the number of samples increases, the “average” value of this noisy horizontal line will drop. But the noise power of the signal is constant! If the average value of this horizontal line is considered to be the noise floor, then the noise floor is a function of the number of samples. This is actually OK…as long as everyone is aware of what is going on as they analyze the data. However, in several projects we have seen folks refer to this as the noise floor, and then be surprised when the noise floor moves up and down depending on how many points they use in their DFT.
Of course there is a natural way to define the noise floor such that it is independent of the number of samples…stay tuned!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
IACP Product Introduction
I just spent the last three days at the IACP show in Denver—the annual conference and expo for the International Association of Chiefs of Police.
For anyone who was once a 12-year-old boy, IACP is about as cool as it comes, because there is all sorts of cop paraphernalia on display—from Bell helicopters to Sig Sauer firearms to light bars. Even our own CaseCracker Interview Management System was on display in a partner’s booth.
But the most exciting part about the show was that our customer Decatur Electronics unveiled the Digital Responder 4000, a product we’ve spent the last several months building for them. It’s a highly integrated digital video recorder for the in-police-car market, containing two channels of H.264 video compression, two channels of audio recording, integrated radar and GPS. You can read our press release here.
We built the Digital Responder 4000 in about nine months, devoting a team of about seven engineers to the project. I want to publicly thank Mike, Bernard, Ted, Barb, Dave and Wei Ning for all their efforts—great job, guys!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Working with CUDA
We’ve recently been working with a cool technology that is rapidly penetrating scientific and engineering computing, but seems little known otherwise. It’s called CUDA. In a nutshell, it is an SDK to allow you to run parallelizable compute-intensive applications on your Nvidia graphics card instead of serially on your CPU.
CUDA is one of a number of emerging methods that all more or less enable the same concept. A similar approach is behind OpenCL, which is backed by Apple and purports to be more cross-platform, in that it eventually will allow developers to develop code for a range of GPUs and not just those from Nvidia. However, at the moment OpenCL is limited to Macs running Apple’s new Snow Leopard, so I’m not sure that’s more open than CUDA.
Our use of CUDA involves running a moderately intensive frame-based image-processing algorithm on extremely high-definition images. The images are 3840×1080 monsters, and the goal is to process a total of 24 such images per second, so it’s definitely not something you can accomplish on a CPU alone.
Our client in this case had originally implemented the image-processing algorithm in MATLAB, so our first task was to convert from MATLAB to C. The resulting C code was the basis for using Nvidia’s CUDA SDK to get the algorithm running on the Nvidia board. We also had to tie the entire framework into Microsoft’s DirectShow architecture, because the data is delivered to us as H.264 encoded images.
The parallelization comes through the GPU chip’s many stream processors—also called thread processors. For example, the GeForce 8 GPU has 128 stream processors. Each stream processor comprises 1024 registers and a 32 bit floating point unit. The stream processors are grouped on the chip into clusters of 16. Each cluster shares a common 16 KB memory and is referred to as a “core.”
From a software perspective, each stream processor executes multiple threads, each of which has its own memory, stack, and register file. For the GeForce 8 GPU, each stream processor can handle 96 concurrent threads (for a total of 12,288) although this number is seldom reached in practice. Fortunately, programmers do not need to write explicitly threaded code since a hardware thread manager handles threading automatically. The biggest challenge for the programmer is to properly decompose the data so that the 128 stream processors stay active.
In an efficient decomposition, the data is subdivided into chunks that can be allocated to a core. Algorithms execute most efficiently when the data for the threads executing on a core can all be stored in the core’s local memory. Data is transferred back and forth between the host CPU and the GPU’s global memory (i.e. graphic device memory) via DMA. Data is then transferred back and forth from device memory to core memory as needed. An efficient implementation will minimize the number of device to core data transfers.
In CUDA terminology, a block of threads runs on each core. While one thread block is processing its data, other thread blocks (running on other cores) are performing identical operations in parallel on other data. Thread blocks are required to execute independently and they must be able to be executed in any order. To specify the operations of a thread block, programmers define “C” functions, called kernels. The kernel function is executed by each thread in a thread block. Fortunately, there is specific CUDA support for synchronizing the threads within a block.
We have enjoyed this project immensely and are excited by CUDA’s prospects. We look forward to using CUDA on other projects in the future. More information on CUDA can be found on Nvidia’s website. I found this article particularly useful as an introduction.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.










