Archive for the ‘Theory’ Category
The Basics of 3D Image Acquisition
One of our clients is heavily involved in 3D video and has been for several years. However, several are just now starting to think about it because of the uptick of interest in the consumer electronics world. Enough questions have been posed to us recently that it seemed worthwhile to me to pull together a few basic facts regarding 3D stereopair imaging and stereo disparity.
First, we need a simple model of a lens. Consider the diagram below:
In this picture, the long horizontal line that passes through the center of the lens is called the lens axis. The lens has the property that rays that pass through the center of the lens are undeviated. Therefore, the ray from the top of the tree, at a distance l to the left of the lens, passes straight through the center of the lens. (The tree has a height of h.) The lens also has the property that rays that arrive perpendicular to the lens are refracted to pass through the focal point of the lens. The focal point lies on the lens axis and is a distance f from the center of the lens. The intersection of these two rays shows where the image of the tree will be formed. You can see that the image of the tree is upside down, and has a new height h’. The image is formed a distance d to the right of the focal point.
By using similar triangles we see first that
Using a different pair of similar triangles we also see that
Solving the first equation above for h’, substituting the result into the second equation and simplifying, we derive the following relationship:
This is the fundamental equation of a simple lens. It shows that as the object gets further and further from the lens, i.e. as l increases, the distance of the image of the object from the focal plane decreases, i.e. d gets smaller. We can assume that the camera’s image sensor is located at a distance f from the lens, is perpendicular to the lens axis, and that all objects more than a certain distance away from the lens will be in focus. In other words, the image of all sufficiently distant objects will appear on the focal plane where the image sensor is located.
In the case of 3D video, two cameras are used to acquire a sequence of stereopair images, one from the left camera and one from the right. Different stereo geometries are possible, but the most common one is to place the two cameras horizontally apart from each other by a distance i, and to keep their focal planes coplanar. The diagram below illustrates this configuration:
The horizontal line at the bottom is the focal plane; it is clear from the diagram that the focal planes are coplanar. The lenses are a distance f from the focal plane and are separated by a distance of i from each other. We assume that a small object (or a point on a larger object) is located a distance l from the lens plane and a distance m to the right of the axis of the right lens. We want to know where the image of that object appears in the left and the right camera. In particular, we want to know if we overlaid the left image on top of the right image, how far apart would the images appear? Mathematically, we want to know the disparity, which we define to be
where s1 and s2 are the distances from the image point to the intersection of the lens axis with the focal plane for the left and the right cameras respectively. Note that we are assuming that the object being imaged is far enough away that its image forms on the focal plane.
Using our favorite trick of similar triangles we have the following two equations:
and
Solving the first equation for s1, the second equation for s2, taking the difference and simplifying yields
Although this expression was derived for an object to the right of the axis of the right camera, it is easy to show in a similar manner that it is also true for an object between the axes of the two cameras as well as for an object to the left of the axis of the left camera.
So what does this equation tell us? First, it says that for this particular camera geometry, the disparity is only a function of the separation between the two cameras, i, and the distance of the object from the lens plane, l. Second the equation tells us that the disparity increases as we increase the separation between the cameras. Finally, it tells us that the disparity decreases as the object gets further away from the cameras, approaching zero for objects an infinite distance away. (You can see this when you watch 3D content without wearing the special 3D glasses: The “distant” objects can be seen by the naked eye, whereas the near objects appear blurry to the naked eye, because the value of ρ is greater.)
It should be clear from this equation that if a stereopair is available, and corresponding points can be found in the left and right pictures, that the disparity between those points can be measured, and the distance to the point can be computed.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Delta Sigma Converters: Filtering, Decimation, and Simulations
In my first post on ΔΣ converters I presented an intuitive way to derive the modulator portion of the converter. Now we need to look at what comes after the modulator—namely, the digital filter and the decimator. The high-level structure of the converter looks like this:
The analog input voltage, v(t), is assumed to be scaled so that its range is in the interval [-1, 1]. The output of the modulator is a sequence of uniform width pulses with a magnitude of either ‑1 or 1. These pulses are generated at an “oversampled” rate—in other words, at a rate greater than the Nyquist rate. The oversampling is by some significant factor, for example, 64. Let the highest frequency present in the signal be B; the Nyquist rate is then 2B, and the output rate of the modulator for an oversampling factor of 64 is 128B.
We wish to reduce the sample rate to the Nyquist rate while extracting the signal v(t) from the pulses. We do this by lowpass filtering the pulse train to a bandwidth of B and then sampling the filter’s output at the rate 2B. Consider a FIR LPF filter. Because the pulse sequence only has 1s and -1s in it, the FIR filter can be implemented with additions and subtractions only. Furthermore, although the 1s and -1s are shifted into the filter at the oversampled rate, the filter’s output only needs to be computed at the Nyquist rate, which is nice from a computational perspective because we only need to compute the samples we’re going to keep!
To make this all clear, let’s trace the flow of a simple input signal through the system in both the time and frequency domain. For this example, I’ll let B be 4KHz, which corresponds to telephone quality audio. Let’s assume an oversampling factor of 64. Then pulses are output from the modulator at the rate 2 × 4000 × 64 = 512 kilobits/sec (bearing in mind that we convert the sequence of -1s and 1s into an equivalent sequence of 0s and 1s, so each pulse is captured as a single bit). As an input signal we’ll use the sum of two cosines, one at the low end of the pass band and one at the high end (730 Hz and 3.765 KHz respectively). Both cosines have the same amplitude, 0.5, so their sum is within the [-1,1] range required. The input signal looks like this:
The spectrum of this signal computed over a 4-second interval is shown below. Here the x-axis is in KHz and the y-axis is in dB relative to the highest power frequency in the signal.
As expected, the input signal has two well-defined peaks.
The sequence of pulses output from the ΔΣ modulator when this signal is input looks like this:
In intervals where the input signal is large, many 1s are output, while in intervals where the input signal is close to zero, 1s and -1s occur with a nearly equal frequency. Since we propose to recover the input signal by lowpass filtering this pulse stream, it is interesting to look at the spectrum of the pulse stream. In the range of 0 to 15 KHz it looks like this:
The original signal spectrum is clearly visible, but there is lots of noise above 4 KHz. In fact, if we plot all the way out to 256 KHz (1/2 the oversampled rate of 512 KHz) we see the following:
The 1-bit sampling has introduced a huge amount of quantization noise, and it increases with frequency. Were we to subsample the signal at an 8 KHz rate without first lowpass filtering, all of this noise would alias into the 4 KHz band we care about! To do the lowpass filtering I used a 1000-tap FIR filter in my simulation; I wanted to show the effect of a very good filter. After filtering, the spectrum looks like this:
Most of the out-of-band quantization noise has now been eliminated. After subsampling the signal by a factor of 64, we are down to the Nyquist rate. Any quantization noise remaining after the LPF has now been aliased into our frequency band of interest. The spectrum of the Nyquist sampled signal over the range 0 to 4 KHz is shown below; the two cosine signals have been converted with very little distortion.
As a final note, it’s worth showing the spectrum of the modulator’s output pulse train when the oversampling factor is only 4 instead of 64. As seen below (the graph covers the range 0 to 15 KHz), there is a lot more in-band noise (i.e. in the range 0 – 4 KHz) than occurs with 64x oversampling. Not surprisingly, higher sampling rates lead to better performance!
Here is the code I used for these simulations.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Delta Sigma Converters: Modulation
The web is filled with introductions to Delta Sigma modulation (also sometimes referred to as Sigma Delta modulation) in the context of Delta Sigma converters. Unfortunately, the ones I’ve looked at fail to intuitively motivate how the modulator works. Therefore, my goal in this post is to show how the structure of a first-order ΔΣ modulator can be simply understood. In particular, I show how its structure can be derived from a trivial 10-line C program that performs the ΔΣ modulator function.
The basic problem solved by a ΔΣ modulator is this: given a fixed input value ν that lies in the range [-1, 1], the modulator outputs a pulse train satisfying the following criteria:
- Each pulse has the same fixed width τ
- Each pulse has an amplitude of -1 or 1 only
- The time average value of the pulse train emitted converges to ν
To derive from first principles a system diagram to perform this task, I find it useful to proceed as follows: First, write down a few equations to make precise what we’re after; next, write a computer program that emits a pulse train satisfying the equations; and finally, draw a block diagram that realizes the computer program.
So let’s get started. First, a waveform f(t) has an average value on the interval [0,T] of
![]()
Let l(i) be the magnitude of the pulse of duration t emitted in the interval [iτ, (i + 1)τ]; l(i) must be 1 or ‑1. The average value of the pulse train waveform f(t) corresponding to a sequence of pulses is given by
![]()
So we don’t need to worry about τ, which is nice! Here we have defined S(N) to be the sum of the l(i) values from 0 to N – 1, so S(1) = l(0), S(2) = l(0) + l(1), and so on.
How can we write a computer program to decide whether the next pulse to output should be a 1 or a -1? I think the first algorithm most of us would think of is the following: “keep track of the average value, S(N) / N, at each instant. If the current average is less than ν, output a 1 next; otherwise output a -1.” I will skip a formal proof that the pulse train emitted by this algorithm converges as desired, but it is intuitively reasonable that it does, because each additional output pulse moves the average up or down by a smaller amount than the previously emitted pulse, and each pulse nudges the average in the desired direction.
It’s straightforward to write a computer program that accomplishes the above. In C-like pseudo code the program looks like this:
N=1;
while (1) {
S_N = compute_current_S_N();
if ( (S_N/N) < v )
output(1);
else
output(-1);
N++;
}
This can be equivalently written as
N=1;
while (1) {
S_N = compute_current_S_N();
if( N*v – S_N > 0 )
output(1);
else
output(-1);
N++;
}
Now it’s time to draw a system diagram to implement this algorithm. Clearly it will contain some sort of feedback, because the pulse value to output next depends on the past sequence of output values. There will also be a need to compute a difference, and somehow we will need to accumulate the values S(N) and Nν. Consider the system diagram below:
Lets analyze this diagram to see if it does the right thing. I’ve labeled the input ν and the feedback line l. With respect to the feedback line, the box with a z in it means “delay by one time instant”; the indexes of l on either side of this box reflect that delay. The box with a plus sign in it is an instantaneous summer, which has been configured in this case to take the difference between its two inputs. The large box is a quantizer, and the graph within it shows the input/output transfer function q( ).The graph means that any input value greater than 0 will result in an output value of 1, while any input value less than 0 will result in an output value of ‑1. Finally, the box with a capital sigma (Σ) in it is an accumulator, also know as an integrator. Its output is the sum of all its previous inputs.
To see if this system diagram results in the required stream of pulses l(i), we need to analyze the variable y. This is most easily done by creating a table as shown below:
| N | l(N-1) | y(N) | l(N) |
| 0 | N/A | y(0) is initialized to 0 | l(0) is initialized to 1 |
| 1 | l(0) = 1 | y(1) = ν-l(0) | l(1)=q( y(1) ) |
| 2 | l(1) | y(2) = (ν-l(0)) + (ν-l(1)) = 2ν – l(0) – l(1) |
l(2) = q( y(2) ) |
| 3 | l(2) | y(3) = 3ν – l(0) – l(1) – l(2) = 3ν – S(3) |
l(3) = q( y(3) ) |
| …and in the general case | |||
| N | l(N-1) | y(N) = Nν – S(N) | l(N) = q( y(N) ) |
The quantizer is executing the “if then” portion of the computer program, and the summer and accumulator are computing the running sum needed as the argument for the if function (i.e., as the input to the quantizer). This system will do the trick!
This diagram is often written in a slightly different form to more closely model its realization in hardware as follows:
All that is going on here is that the quantizer in the first diagram is replaced with the combination of an analog-to-digital converter (ADC) and a digital-to-analog converter (DAC). The ADC outputs a 1 if its input is greater than 0 and a 0 if its input is less than zero, so a bitstream is created at the indicated point. The DAC converts the bitstream back to a sequence of -1 and 1 values. Finally, to indicate the analog nature of this modulator when used as part of a ΔΣ converter, the accumulator is replaced by an integrator:
Now, assume the ΔΣ modulator above is used as part of an analog-to-digital converter. If the input signal changes slowly relative to the rate at which bits are produced, then many bits will be generated before the signal changes significantly, and the average value the bits encode will be close to the true value. In other words, the 1s and -1s the bitstream represents have time to “average out” to an accurate representation of the input signal. Of course, this average still has to be computed numerically after the modulator; more on that later. However, the faster the input signal changes, the fewer bits will be generated before the signal has changed significantly, and the less accurately the bitstream will represent the signal. At an intuitive level, this is why the quantization noise of ΔΣ converters increases as a function of the input frequency of the signal.
When used as part of an analog-to-digital converter, the modulator is followed by a digital filter and a decimator. The digital filter computes the averages encoded in the bitstream; the decimator reduces the sample rate from the high rate of the 1-bit ADC to the Nyquist rate associated with the final multi-bit samples. The binary representation of the decimated samples is used as the AD output bits.
Part II discusses filtering and decimation in greater detail, and presents some simulation results.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
The Math Behind Analog Video Resolution
The world is moving in the direction of HDTV, but NTSC “standard def” signals are still common for many reasons and will remain so. One important reason is that cameras that output NTSC are widely available and cheap! Many applications, including a lot of security applications, simply don’t require the resolution of HDTV…and don’t want to incur the camera cost and bandwidth hit it requires.
So, what is resolution anyway, especially with regards to an analog NTSC video signal? Analog video cameras, especially in the CCTV industry, are sold using “horizontal TV lines”, or HTVL, as one of their key specifications. Unfortunately, the math behind that concept is not well understood.
To understand resolution we start with the aspect ratio. The aspect ratio of a picture is the ratio of its width to its height. Different aspect ratios are in use today for different applications. HDTV has an aspect ratio of 16:9. Standard definition TV has an aspect ratio of 4:3. In general, resolution is measured in a circle whose diameter is equivalent to a picture’s smallest dimension. The diagram below illustrates the case for NTSC:

In the above diagram, the circle has a diameter of 3 units or 1 “picture height”.
Now, imagine a uniformly spaced sequence of vertical black lines of constant width. The white space between the black lines should be the same width as the black lines themselves. Counting both black and white lines, how many lines can be physically resolved within the above circle? The answer to this question is the horizontal resolution. Before we can go further, we need some facts regarding NTSC:
First, there are 525 scan lines per picture, and the horizontal scanning frequency is
![]()
This is one of NTSC’s magic numbers. The horizontal line time is therefore approximately 1 / 15,734.26573, or 63.5556 µsec.
Second, the horizontal blanking period is 10.7 µsec. During this period a horizontal sync pulse is transmitted, as well as a chroma burst (to enable decoders to demodulate the correct color), and a reference black level. The active line time—the period during which information is actually being drawn on the visible screen—is therefore 63.5556 – 10.7, or 52.8556 µsec.
Finally, the highest broadcast luminance signal is 4.2 MHz.
Based on the above, we can compute the highest horizontal resolution that can be present in an NTSC signal as follows:
The product of the middle two parameters is the number of complete cycles present in one active line (the MHz and microseconds cancel); the factor of 2 is present because we count both the white and black lines in the horizontal resolution calculation. Multiplying by three-fourths takes into account the circle in which the horizontal resolution is defined. Bear in mind that this pattern would be displayed on an NTSC TV as grey, not as a crisp sequence of black and white lines, due to the rolloff of the various filters used to limit the video bandwidth.
In the vertical direction, resolution is limited by the number of scan lines. There are 480 scan lines in the visible area of a picture, so one would be tempted to assert that the vertical resolution is 480. However, imagine a uniformly spaced sequence of horizontal lines analogous to the vertical line pattern described above. We want this horizontal pattern to be discernible regardless of its relative relationship to the scanning lines. In other words, as the pattern is displaced vertically, the number of lines should still be easily resolved.
Imagine that we have a pattern of 480 horizontal lines, alternating black and white. When these lines are exactly midway between the scanning lines, the resulting picture will be grey. Why? Because the scanning will average the black and white inputs together for each reproduced line. So a pattern of 480 lines would not be discernible: the resolution must be less. The Kell factor measures by how much the vertical resolution is reduced relative to the number of scan lines, and it is usually assumed to be around 0.7 for a stationary pattern. This implies the following vertical resolution:
The horizontal and vertical resolutions are therefore approximately equal. This was one of the design goals of NTSC.
Note that the Kell factor is sometimes assumed to be a larger number for a pattern in motion because visual averaging will cause the eye to “ignore” the occasional grey or blurry pattern. Values as high as 0.9 are assumed for moving images.
Note further that the reason the Kell factor does not apply horizontally is that it is possible to put down the dots on a CRT so close that regardless of the horizontal phase of a 4.2 Mhz vertical line pattern, it will be resolved. However, the Kell factor does apply in the horizontal direction for a digital display such as a computer monitor or LCD panel.
In a digital SDTV system based on the CCIR 601 digital sampling standard, the luminance information is sampled at 13.5 Mhz. The number of samples per active line is therefore given by 13.5 × 52.8556 = 713.56. This number is often rounded up to 720 samples; rounding up provides some headroom on either side of the visible line to hide the edge effects of various digital processing operations such as filtering. Plus, margin is generally good in any design!
Nyquist sampling theory says that sampling at 13.5 Mhz would theoretically allow a horizontal frequency as high as 13.5 / 2 = 6.75 Mhz to be captured. However, because of the inability to implement perfect anti-aliasing filters, this cannot be achieved in practice. A reduction factor of 0.75 is appropriate, implying that the CCIR-601 sampling standard is good for horizontal frequencies as high as 0.75 × 6.75 = 5.06 Mhz. This is substantially better than old-fashioned analog NTSC broadcasts. Therefore, on a good monitor with analog component input, a CCIR-601 signal can achieve the following horizontal resolution:
For a SIF resolution picture (360×240), we are effectively sampling at 6.75 Mhz, not 13.5 Mhz, so the highest horizontal frequency that can be reproduced is 0.75 × (6.75/2) = 2.53 Mhz, which corresponds to a horizontal resolution of:
Bear in mind that to achieve this number one needs to do an excellent job of filtering.
In the case of displaying a SIF picture on a computer monitor, we can approximate the horizontal resolution by using a Kell factor in the horizontal direction. This yields the following:
In the above, the 0.7 is the Kell factor and the 0.75 is to account for the aspect ratio (remember, resolution is computed inside the circle). A value of 360 was used for purity’s sake. The MPEG world uses 352 and 704 because they are related by a factor of 2 and both are multiples of 16 (360 is not a multiple of 16). It also means there is a little less data to compress!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Noise Floor
As shown in a previous post, for samples taken from a zero mean i.i.d noise signal, the expected power of the k’th DFT coefficient is given by
![]()
As discussed in that post, when plotting the power of the k’th coefficient as a function of frequency, the “noise floor” will decrease as N increases…even though the noise power is the same regardless of the value of N chosen.
In situations where it is desirable to avoid this dependency, we can use the DFT to approximate the power spectral density. To see how this is done, consider that the frequency spacing
between DFT coefficients depends on the sampling frequency, Fs, and the number of samples N as follows:
![]()
The power of the k’th coefficient represents the power of all the energy in a bin of width
![]()
Therefore, the power spectral density in Watts/Hz at the frequency corresponding to the k’th coefficient is given by the following ratio
![]()
Note that this value is independent of N. Furthermore, the DFT gives a useful representation of the spectrum over the range
![]()
…so we can get the total power of the signal by multiplying the bandwidth Fs times the power spectral density to get, as expected,
![]()
All of this means that defining the noise floor to be
Watts/Hz
is a good approach. The noise variance is easily measured in the time domain in most systems by just collecting a large number of samples with no input signal applied and then computing the variance of the samples. The sampling rate and the resistor at which the noise samples are collected are normally fixed.
Furthermore, when analyzing data signals in the frequency domain, it should be clear that a spectrum program that plots Watts/Hz is a good choice. The noise floor will appear constant from plot to plot regardless of how many data samples are transformed, and signals will rise above and fall below the noise floor depending on their strength.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Power of a Zero Mean Noise Signal
Part two in a three-part blog series. Read the first entry on power measurements with the DFT.
The concept of the “noise floor” arises repeatedly when looking for signals in the presence of noise. Intuitively, signals disappear when they fall below the noise floor. But Care (yes, that’s a capital C!) must be taken when using this concept to make sure that everyone means the same thing by “noise floor.”
As a stepping stone to a deeper discussion, we need to look at the power of a noise signal in both the time and the frequency domain. Let the samples of a noise signal be Xn, and assume the samples are a sequence of i.i.d zero mean Random Variables (RVs) with variance σ2. This means that the Xk values, the DFT coefficients of the noise sample, are also RVs. What can we say about the distribution of the Xk values?
Recall that by the definition of the DFT,
![]()
Taking the Expected value of both sides we see that:
![]()
This is true because the Xn themselves are zero mean by assumption.
Now, recall from a previous post that the power of the k’th coefficient is given by
![]()
What can we say about
![]()
?
Using the fact that the Xn are independent of each other, as well as being zero mean, it is straightforward to show that
![]()
So we see from substitution that
![]()
Furthermore, the expected power of the noise signal is given by
![]()
Which simplifies to
![]()
This is exactly what we would expect from a time domain calculation of the noise power, which looks like this
![]()
At this point we can clearly see the problem with defining the frequency domain noise floor. Although both the time domain and the frequency domain calculations yield the correct average power, the expected power of the k’th DFT coefficient depends on the number of time domain samples, N.
Consider the following though experiment. Generate a sequence of noise samples with the required statistics, take their DFT, and plot the power of the k’th coefficient as a function of frequency. A noisy horizontal line will be obtained. As the number of samples increases, the “average” value of this noisy horizontal line will drop. But the noise power of the signal is constant! If the average value of this horizontal line is considered to be the noise floor, then the noise floor is a function of the number of samples. This is actually OK…as long as everyone is aware of what is going on as they analyze the data. However, in several projects we have seen folks refer to this as the noise floor, and then be surprised when the noise floor moves up and down depending on how many points they use in their DFT.
Of course there is a natural way to define the noise floor such that it is independent of the number of samples…stay tuned!
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.
Power Measurements with the DFT
Many of the systems we work on here at Cardinal Peak employ analog-to-digital conversion in one form or another. It seems like almost invariably some analog signal needs to be digitized or “sampled.” Sampling is the process of periodically looking at an analog signal at some sample rate, and then converting each measured value into a digital representation. The resulting samples have a resolution of some number of bits per sample; different applications will require different resolutions. Of course, the advantage of a stream of digital samples is that they can be efficiently processed by a DSP or other processor.
When analyzing sample data, the Discrete Fourier Transform is a powerful and handy tool, and we employ it often in the work we do for our clients. A recurring theme is that of power: how can the DFT be used to measure a signal’s power? Many of our clients’ engineers are familiar with the DFT and Parseval’s theorem, but need some reminding on how to apply it to power measurements in a real system. So this post briefly reviews that issue.
First, because the DFT pair can be expressed in different ways, it’s important to state which definition I’m using. I prefer the following commonly-encountered formulation:


In these equations the xn are the time domain samples of the analog signal, acquired at some sampling rate FS, and the Xn are the DFT coefficients.
Parseval’s theorem relates the time domain samples to the frequency domain samples via the following relationship:

If the signal being sampled is a voltage, we can turn the left side into an approximation of the integral of the voltage squared as follows:

Here we have multiplied both sides by 1 / FS because the samples are spaced by this amount of time from each other. Since we are processing N samples total, the time required to acquire these N samples is N / FS. Dividing both sides by this total duration, we get the average of the integral of the voltage squared over the acquisition period of the N samples.

Finally, the voltage was measured across a resistance of some value R, so the average power of the signal is given approximately by

Furthermore, it is reasonable to speak of the power carried at frequency

as being

Coming soon: noise floor and the DFT.










