Archive for the ‘Ben’ Category
High-School Prior Art
Based on a Slashdot link, I read a blog post today about the Oracle/Google lawsuit over Java virtual machine patents.
Among the list of software patents that Oracle has decided to sue Google with are two that I’m fairly certain that I wrote prior art for. This was in 1983 or 1984, while I was a high school student. This work was published in a book by the University of Oklahoma press, and I believe I still have a copy somewhere.
It seems that the bar for patents is not very high if patents are granted for something a high school student could come up with a decade earlier!
In ’83 or ’84, I wrote a hybrid interpreter/compiler for a Forth-like language.
I had read the book Threaded Interpretive Languages: Their Design and Implementation and decided to write such a language for my TRS-80 model 1. I gave it the snappy name, TIL16U, which stood for Threaded Interpretive Language, 16 bit unsigned ints. The only numeric datatype was the 16-bit unsigned integer, which also nicely represents a pointer datatype on the Z-80 processor.
I took a slightly different approach to writing the language than the book described. Initially I wrote a simple single-pass compiler from the source language to a simple bytecode in TRS-80 Level II BASIC.
Once I could generate bytecode from ASCII source, I wrote an interpreter for it in Z-80 assembly. The interpreter for such a simple stack based language is very small, and it was dwarfed by the inclusion of a text editor (incorporated from an article in 80 Microcomputing with a source listing) and cassette I/O routines to load/save my programs off to tape. I used the initial compiler written in BASIC to compile the runtime library for my interpreter.
My library data structure consisted of a routine name stored as an ASCII string, a length, a flag to indicate whether a given routine was implemented in Z-80 machine code or in bytecode, and then the routine body. While interpreting, the TIL16U interpreter would look at the bytecode for a call routine, and following this would be a routine name. The interpreter would look up the routine name in the library, and either jump to it directly (if it was in machine code), or continue interpreting at the start of the routine (if it was in bytecode).
At the time, I was very interested in writing videogames—my parents wouldn’t allow me to have an Atari 2600—and the 1.77 Mhz processor on the TRS-80 was not particularly powerful. So the lookup step bothered me performance-wise. I knew I could implement a symbol table and a second pass in the compiler to patch in the addresses of routines, but this made the language less dynamic. I wanted to be able to type in code interactively and have it work, and it was also a hassle to store addresses in the code because whenever I built a new version of the interpreter or the runtime library they might have to change.
So I hit upon the expedient of making the interpreter selectively compile bits of the running program to machine code as it went and overwriting the bytecode with machine code that could be directly jumped to, and pushing the address of the interpreter on the stack so that when a Z-80 RET instruction executed, the interpreter continued executing. I thought it was cool that the more you ran a program, the faster it would go.
In particular, the bytecode to invoke a library routine could be replaced by a three-byte Z-80 CALL instruction (a one byte opcode plus two bytes of address).
So when the interpreter found such a bytecode sequence, rather than execute it, it would look up the symbol address, overwrite it with the machine instruction, PUSH the return address of the interpreter on the stack and then jump to it. In the future, the interpreter would look at the opcode byte and know just to jump directly to the code sequence. (As an aside, it is interesting how much of the Z-80 instruction set is still burned into my brain. I can still remember an unconditional jump, or JP, is 0xC3, and the RET instruction is 0xC9, thanks to writing this code.)
I believe the symbolic routine name to numeric address is the mechanism described in one of the patents in the Oracle lawsuit, granted in 1992, and last updated on April 29, 2003.
Additionally, this patent, also referenced in the lawsuit, covers overwriting virtual machine instructions with native code and executing those instead of the bytecode. This patent was originally filed in 1997, and was last updated in 2005.
I’m not claiming that I have perfect prior art for these two patents—I haven’t even studied them very closely—but I believe the techniques I used are at least highly similar to what is patented. Only a patent court can decide if something infringes, anyway.
I’d like to claim this was entirely because of my brilliance, but I think it is more likely that the US Patent Office has been granting software patents that are so obvious that a reasonably bright teenager with a computer can come up with them.
To finish the story, the most successful application of TIL16U did actually turn out to be video games. My parents bought me a ChromaTRS color video board for my TRS-80 (“with 15 vivid colors!”), and I wrote several games for it, one of which I remember was a parachute jump game, where you jumped out of a plane and tried to land on a target. The plane speed and wind speed would vary, so you had to judge how the windsock was looking to get the high score.
Ben Mesander has more than 18 years of experience—not counting high school!—leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Real-Time Ethernet
A recent project for a customer was to implement a transport for a real-time signal processing application over Gigabit Ethernet (GbE). The project was especially interesting, because our customer’s requirement was for extremely low latency: The transport needed to be able to send replies to incoming packets with a custom EtherType within 74.4 μsec. This exceeded the capabilities of existing Ethernet protocols.
There are several approaches possible for real-time networking under Linux, including RTAI with the RTnet hard real-time network stack.
However, the customer was already using Red Hat Linux servers to perform the signal processing, so we decided to try the less intrusive Red Hat MRG kernel to see if it could be modified to incorporate the desired networking protocol. MRG incorporates many kernel patches to improve real-time performance and lower latency.
Since the timing requirements were so tight, I decided that only an in-kernel implementation was likely to succeed. If the code had to switch between kernel mode and user mode, this would require additional overhead, interactions with the Linux scheduler, and moving the data between the kernel and user address spaces.
I started development by building two server-class Linux machines running Red Hat Linux with the MRG kernel, connected back-to-back with a GbE cable. Then I wrote a user-mode test program to send an Ethernet packet of the appropriate type via a raw socket. I now had a way to send a packet from one machine to another. I set up Wireshark to monitor the traffic between the two machines
Initially, I attempted to modify the device-independent portion of the Ethernet stack within the Linux kernel to detect the incoming packet and respond to it. Unfortunately, I had to discard this approach when I found it could not meet the customer’s latency requirement.
So I experimented with modifying the Ethernet drivers for various GbE PCIe Ethernet cards in the MRG kernel. The basic approach I took was to invoke the outgoing packet interrupt service routine from the incoming packet interrupt service routine. The driver has to be in just the right state for this to work correctly, so it required some study of the driver source code. Fortunately most Linux Ethernet drivers share some common framework, so it did not require a complete re-engineering effort to try different cards, but there are definitely some differences.
I initially started with a Realtek card and driver, and had some success. I installed the kernel driver on both machines, so that once I sent a single packet from one machine to another via my user-space program, the two machines would continuously exchange Ethernet frames with each other at the maximum speed achievable.
This made measurement challenging because my modifications to the driver were at a low enough level that the outgoing packets did not go through the device-independent Ethernet stack in Linux, and thus were not captured by Wireshark; and additionally, Wireshark was too slow to keep up and often dropped packets. I was, however, able to use tcpdump to capture packet headers to a file and then look at them later with Wireshark to make measurements. The latency could be estimated by looking at the inter-packet timing from packets sent by the other machine and dividing by two. Using tcpdump mostly got rid of most of the packet losses, but it would still sometimes drop some. I resorted to putting a serial number in each packet which I could then examine to determine if I had lost a packet or not.
I also tried inserting a GbE switch between the two machines and monitoring packets from a third machine. However, I found the switch introduced 20 μsec of additional latency on average, and greatly increased the deviation of the measurements. And still, sometimes packets were lost. So I discarded this approach and used tcpdump running on one of the machines being tested.
I was unable to reliably meet the 74.4 μsec spec under load with the Realtek card, so the next card I tried was an Intel GbE card. This card used the Intel e1000e driver, which is supported directly by Intel rather than the reverse-engineered driver the Realtek card used. Unfortunately I was unable to find a card supported by the version of the e1000e driver contained in the MRG kernel, so I downloaded the latest Intel driver. This driver was significantly more complex than the MRG version of the driver, and I made some measurements.
Finally, I modified the Broadcom bnx2 driver in the MRG kernel. This driver is directly supported by Broadcom. I was able to use the MRG kernel driver with my card, and so I made appropriate modifications.
I found this card, like the Intel card, was also able to keep up with the desired data rate. The Broadcom card ended up having slightly lower latency measurements than the Intel card. I suspect this is due to the MRG kernel driver for bnx2 having less locking overhead and shorter code paths than the stock Intel e1000e driver.
In the end, with the Broadcom bnx2 driver with my modifications, we achieved an average latency measurement of 58 μsec, which was comfortably under the 74.4 μsec requirement. Additionally, we tested continuously over a several day period, monitoring for missing packet serial numbers, and none were detected.
The customer’s initial protocol required sending several smaller packets in either direction, but ultimately an additional speedup could have been realized by using jumbo frames. RTAI/RTnet could also have been used, but I do not think it would have been significantly faster, although it may have reduced the variability of the latency.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Creating Single Frame Movies
My camera (an Olympus SP-570UZ) allows me to optionally record a four-second audio clip with each photo I take. I haven’t used this feature much, because I typically upload my photos to Flickr, and there’s been no good way to associate the audio with the video. Ideally, I would like an audio player to appear below the photo, but there aren’t really any public audio sharing websites with much longevity. And, in any case, Flickr won’t allow me to embed an audio player in my photo description.
Recently, it occurred to me that since Flickr allows short movies (up to 1:30 long), maybe I could create a single-frame movie with the still picture as the frame and the audio as the sound track. Then the Flickr movie player would serve as the control for the audio, and the audio and the video would stay associated with each other.
I decided to try to use ffmpeg to create the movie, since it seems to be able to do almost anything with video and audio. The command line for ffmpeg is a bit obscure, so this blog post documents about two hours of my time spent getting it to work.
My camera produces 3648×2736 JPEG images, and the audio files are 8 kHz sample rate, mono, 8 bit unsigned PCM samples in WAV file format. I decided my goal would be to create a motion JPEG (MJPEG) encoded AVI file with maximum quality.
I started by searching the web to see if anyone had done this before. By studying those examples and experimenting, I came up with the following ffmpeg command line:
ffmpeg.exe -loop_input -shortest -f image2 -r 0.25 -i P910033.jpg -i P910033.wav -vcodec mjpeg -qscale 1 -t 4 foo.avi
Most of my attempts caused ffmpeg to hang. But eventually, I got the error message below:
Duration: 00:00:04.00, start: 0.000000, bitrate: N/A
Stream #0.0: Video: mjpeg, yuvj422p, 3648x2736, 0.25 tbr, 0.25 tbn, 0.25 tbc
[wav @ 01a80050]Estimating duration from bitrate, this may be inaccurate
Input #1, wav, from 'P6060033.wav':
Duration: 00:00:04.02, bitrate: 64 kb/s
Stream #1.0: Audio: pcm_u8, 8000 Hz, 1 channels, u8, 64 kb/s
[mp2 @ 01ac6310]Sampling rate 8000 is not allowed in mp2
Output #0, avi, to 'foo.avi':
Stream #0.0: Video: mjpeg, yuvj422p, 3648x2736, q=2-31, 200 kb/s, 90k tbn, 0
.25 tbc
Stream #0.1: Audio: mp2, 8000 Hz, 1 channels, s16, 64 kb/s
Stream mapping:
Stream #0.0 -> #0.0
Stream #1.0 -> #0.1
Error while opening encoder for output stream #0.1 - maybe incorrect parameters such as bit_rate, rate, width or height
At last I understood the problem: ffmpeg needs the audio sampled at some rate other than 8 kHz. So I decided to use Audacity, another open source application, to upsample the sound. However, now Audacity was unhappy with this audio format.
So I used Project->Import Raw Data, and selected my WAV file. I set up the import with the following parameters:
I knew this would work, because the WAV file format consists of a header, followed by PCM data, in this case 8 kHz unsigned samples. So the result in the audio editor would be an audio file with the WAV header as a noisy sound at the start, followed by the data I wanted. The selected (darker) portion of the WAV file below is the header. I used Edit->Cut to remove it.
Finally, I tried to save the audio at a different sample rate. The audio file has a pulldown menu that lets you change the sample rate, but it doesn’t do what I wanted—what it does is play the audio file back at a different rate with aliasing.
Instead, after consulting the Audacity documentation, I discovered you use the menu in at the lower left corner of the main Audacity window to set the sample rate.
Change this to 48000, and choose File->Export as WAV to save at the new sample rate. I re-ran ffmpeg, and the resulting AVI file would play in QuickTime and VLC player (although VLC crashes afterwards), but it would not work in Windows Media Player (audio played, no video), divx, realplayer, or Flickr. So, I decided to try encoding to mp4 instead with the following command:
ffmpeg.exe -loop_input -shortest -f image2 -r 0.25 -i P910033.jpg -i P910033.wav bar.mp4
The resulting mp4 file plays in all the media players (although, again, VLC crashes after playing it), and Flickr can read it successfully as well. Here is what it looks like on Flickr:
Using size as a proxy for quality, however, the encoded video is much smaller than the input JPEG file. Can someone suggest additional flags to ffmpeg to improve the encoding quality?
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Creating the Orton Effect in Gimp
Recently I decided to learn how to write scripts in the Gimp image editing program to automate certain tasks. The first task I wanted to automate was the Orton effect. This is an effect invented by Michael Orton in the 1990’s, which consists of taking two copies of an image, one blurred, and one sharp, and mixing them to produce an image with a dreamy quality. It is especially well suited to landscape and flower photography.
The Orton effect was originally achieved by taking two photos: a well-focused image that was overexposed by two stops, and an out-of-focus image of the same scene that was overexposed by one stop. These were then printed as slides and sandwiched together to produce the final image.
With digital photography, one way to achieve this effect is to shoot a single raw image of a scene. The raw image can be developed to two JPEGs, one at +1 EV (Exposure Value), and the other at +2. My script blurs the +1 EV image with a two dimensional Gaussian filter with a standard deviation of 40 pixels, loads the second +2 EV image, sharpens it with an unsharp mask, and then overlays the two images. There are a variety of ways the images can be overlaid, but I prefer to multiply them, which enhances the color saturation in light areas. This is done by the Gimp by calculating (blur layer × sharp layer) / 255, which results in the image darkening, and an increase in color saturation.
My Gimp script to do this is available on the Gimp plugin registry.
The soft focus of the colors and the sharpness of the image got me thinking: Is the Orton effect really equivalent to heavily subsampling the chroma channels of the image, and sharpening the luma channel? JPEG and MPEG compression both make use of the fact that the human eye is not as sensitive to chroma (color) as it is to brightness (luma). Typically, both still and video compression uses 4:2:0 chroma subsampling to reduce the number of bits used to represent color information in compressed images without a perceptible quality difference to the human visual system.
I decided to test my theory. It turns out the Gimp has the ability to decompose an image into its YCbCr luma and chroma components used in the JPEG and MPEG compression process.
![]() Y |
![]() Cb |
![]() Cr |
Once I had the image split into its separate components, the Gimp allowed me to apply my Gaussian filter to just the Cb and Cr components, and then regenerate a new color image from the components.
Unfortunately, as you can see, the image is nothing like the image that underwent Orton processing—my intuition was wrong. However, it is interesting to see just how much one can low-pass filter an image without a huge impact on the image. I increased the standard deviation of my Gaussian filter from 40 pixels to 100 with the following result—the image is still recognizable and doesn’t look too bad, although the color bleeds outside the lines. It’s interesting to note that the resulting JPEG is also smaller because the low-pass filtered chroma information is easier to compress.
Additionally, it is interesting to see what happens if we decompose our squid into RGB components instead of YCbCr and filter two of them with a 100-point deviation Gaussian filter.
Yuck. We can clearly see the advantage of chroma subsampling here over RGB subsampling.
World’s Smallest h.264 Encoder
Recently I have been studying the h.264 video codec and reading the ISO spec. h.264 a much more sophisticated codec than MPEG-2, which means that a well-implemented h.264 encoder has more compression tools at its disposal than the equivalent MPEG-2 encoder. But all that sophistication comes at a price: h.264 also has a big, complicated specification with a plethora of options, many of which are not commonly used, and it takes expertise to understand which parts are important to solve a given problem.
As a bit of a parlor trick, I decided to write the simplest possible h.264 encoder. I was able to do it in about 30 lines of code—although truth in advertising compels me to admit that it doesn’t actually compress the video at all!
While I don’t want to balloon this blog post with a detailed description of h.264, a little background is in order. An h.264 stream contains the encoded video data along with various parameters needed by a decoder in order to decode the video data. To structure this data, the bitstream consists of a sequence of Network Abstraction Layer (NAL) units.
Previous MPEG specifications allowed pictures to be coded as I-frames, P-frames, or B-frames. h.264 is more complex and wonderful. It allows individual frames to be coded as multiple slices, each of which can be of type I, P, or B, or even more esoteric types. This feature can be used in creative ways to achieve different video coding goals. In our encoder we will use one slice per frame for simplicity, and we will use all I-frames.
As with previous MPEG specifications, in h.264 each slice consists of one or more 16×16 macroblocks. Each macroblock in our 4:2:0 sampling scheme contains 16×16 luma samples, and two 8×8 blocks of chroma samples. For this simple encoder, I won’t be compressing the video data at all, so the samples will be directly copied into the h.264 output.
With that background in mind, for our simplest possible encoder, there are three NALs we have to emit:
- Sequence Parameter Set (SPS): Once per stream
- Picture Parameter Set (PPS): Once per stream
- Slice Header: Once per video frame
- Slice Header information
- Macroblock Header: Once per macroblock
- Coded Macroblock Data: The actual coded video for the macroblock
Since the SPS, the PPS, and the slice header are static for this application, I was able to hand-code them and include them in my encoder as a sequence of magic bits.
Putting it all together, I came up with the following code for what I call “hello264”:
#include <stdio.h>
#include <stdlib.h>
/* SQCIF */
#define LUMA_WIDTH 128
#define LUMA_HEIGHT 96
#define CHROMA_WIDTH LUMA_WIDTH / 2
#define CHROMA_HEIGHT LUMA_HEIGHT / 2
/* YUV planar data, as written by ffmpeg */
typedef struct
{
uint8_t Y[LUMA_HEIGHT][LUMA_WIDTH];
uint8_t Cb[CHROMA_HEIGHT][CHROMA_WIDTH];
uint8_t Cr[CHROMA_HEIGHT][CHROMA_WIDTH];
} __attribute__((__packed__)) frame_t;
frame_t frame;
/* H.264 bitstreams */
const uint8_t sps[] = { 0x00, 0x00, 0x00, 0x01, 0x67, 0x42, 0x00,
0x0a, 0xf8, 0x41, 0xa2 };
const uint8_t pps[] = { 0x00, 0x00, 0x00, 0x01, 0x68, 0xce,
0x38, 0x80 };
const uint8_t slice_header[] = { 0x00, 0x00, 0x00, 0x01, 0x05, 0x88,
0x84, 0x21, 0xa0 };
const uint8_t macroblock_header[] = { 0x0d, 0x00 };
/* Write a macroblock's worth of YUV data in I_PCM mode */
void macroblock(const int i, const int j)
{
int x, y;
if (! ((i == 0) && (j == 0)))
{
fwrite(¯oblock_header, 1, sizeof(macroblock_header),
stdout);
}
for(x = i*16; x < (i+1)*16; x++)
for (y = j*16; y < (j+1)*16; y++)
fwrite(&frame.Y[x][y], 1, 1, stdout);
for (x = i*8; x < (i+1)*8; x++)
for (y = j*8; y < (j+1)*8; y++)
fwrite(&frame.Cb[x][y], 1, 1, stdout);
for (x = i*8; x < (i+1)*8; x++)
for (y = j*8; y < (j+1)*8; y++)
fwrite(&frame.Cr[x][y], 1, 1, stdout);
}
/* Write out PPS, SPS, and loop over input, writing out I slices */
int main(int argc, char **argv)
{
int i, j;
fwrite(sps, 1, sizeof(sps), stdout);
fwrite(pps, 1, sizeof(pps), stdout);
while (! feof(stdin))
{
fread(&frame, 1, sizeof(frame), stdin);
fwrite(slice_header, 1, sizeof(slice_header), stdout);
for (i = 0; i < LUMA_HEIGHT/16 ; i++)
for (j = 0; j < LUMA_WIDTH/16; j++)
macroblock(i, j);
fputc(0x80, stdout); /* slice stop bit */
}
return 0;
}
(This source code is available as a single file here.)
In main(), the encoder writes out the SPS and PPS. Then it reads YUV data from standard input, stores it in a frame buffer, and then writes out a h.264 slice header. It then loops over each macroblock in the frame and calls the macroblock() function to output a macroblock header indicating the macroblock is coded as I_PCM, and inserts the YUV data.
To use the code, you will need some uncompressed video. To generate this, I used the ffmpeg package to convert a QuickTime movie from my Kodak Zi8 video camera from h.264 to SQCIF (128×96) planar YUV format sampled at 4:2:0:
ffmpeg.exe -i angel.mov -s sqcif -pix_fmt yuv420p angel.yuv
I compile the h.264 encoder:
gcc –Wall –ansi hello264.c –o hello264
And run it:
hello264 <angel.yuv >angel.264
Finally, I use ffmpeg to copy the raw h.264 NAL units into an MP4 file:
ffmpeg.exe -f h264 -i angel.264 -vcodec copy angel.mp4
Here is the resulting output:
There you have it—a complete h.264 encoder that uses minimal CPU cycles, with output larger than its input!
The next thing to add to this encoder would be CAVLC coding of macroblocks and intra prediction. The encoder would still be lossless at this point, but there would start to be compression of data. After that, the next logical step would be quantization to allow lossy compression, and then I would add P slices. As a development methodology, I prefer to bring up a simplistic version of an application, get it running, and then add refinements iteratively.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Writing Defensive Code to Automatically Find Memory Leaks
In an earlier post, I suggested making all your memory allocations go through a single routine, and deletions through another. When you centralize allocation and deallocation like this, you gain a couple of benefits.
- First of all, you make the memory allocation more explicit, which will tend to make programmers more careful.
- Second, you can write “defensive code” to automatically find bugs during the development process.
I first thought hard about the concept of “defensive code” when I read the book Writing Solid Code by Steve Maguire. This book is now slightly dated, but the general concepts in it are excellent. Defensive code refers to the concept that, especially in debug code used during development, you should have validation routines that are called automatically that will call your attention to latent bugs such as memory leaks.
Once you’ve got your memory allocation and deallocations routines centralized, there are various ways you can instrument these routines during development so your code can tell you about memory allocation problems as you’re developing it.
The easiest thing to do is to keep counts of how many chunks of memory have been allocated and deleted, and monitor it over time to see if the number of currently allocated chunks matches your expectation. This is lightweight enough to implement even in the most constrained environments.
A variation is to keep a copy of the pointer in the allocation routine in some kind of registry data structure. Then you can write a routine to print out the currently allocated items. This is especially helpful if a type of item is leaked only occasionally, as it lets you see exactly what items have leaked. (I have also implemented this in conjunction with Boost smart pointers, but be aware this keeps the reference count at 1, so you have to also write code to release this pointer when the reference is 1 in your printout routine or the items are never deallocated.)
In many embedded environments, you may have the source code to the C library, or even the entire operating system available. If something like valgrind is too resource intensive or unavailable for your platform, you can look at some of the malloc debugging packages available on the net. If none meets your needs or is unavailable on your platform, you can simply modify malloc() and free() to record operations. You will want to log the address and size of each allocation and deallocation to a file or trace facility. You will also find it useful to record the addresses on the call stack, such as with GCC’s __builtin_return_address() facility. You can write a program to examine the saved allocation data and delete all the records that have matching allocation / deallocations. The results left over are malloc() calls with no matching free() calls, and double calls to free(). You can then match up the calling addresses with a symbol table, the output of nm, or the gdb list *address command to find out where these calls were made in your program.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includesembedded software, scientific software and enterprise software development environments.
Tools to Help Find Memory Leaks
Previously, I wrote about my “golden rule” for reducing the prevalence of memory leaks in your code. One other easy way to prevent memory leaks is to actively seek them out during the development process, instead of waiting around for them to be reported to you as bugs.
One good way to do this is to add some time to your schedule so that you can apply the available tools for finding memory leaks and other problems. Two that I have used and found valuable are Valgrind and Coverity.
Valgrind performs many checks on your code, but in my opinion the most useful ones are related to pointer usage and memory allocation.
Valgrind analyzes your program at runtime, and requires significant resources. It may be unsuitable for some embedded applications for that reason, but if you can sufficiently beef up a development system’s memory and storage to run it, or execute your code in another more resource-rich environment, you will find it useful.
The weakness of Valgrind is that since it runs at runtime, you have to be able to exercise many code paths to ensure that you are leak free. I have found that even if the usual flow of execution for an application is leak-free, it is likely that memory leaks occur in error paths.
Coverity, on the other hand, performs static analysis on your source code, as described in this paper. So it covers all your error paths. Coverity costs money, but in my opinion it is well worth it. While it can provide false positives in some cases, the value is in the actual problems it uncovers that are difficult to find with a code review. It is worth examining every item it identifies as being a potential problem. (Incidentally, I recommend this recent article, written by the founders of Coverity, which touches on the organizational difficulties in implementing any automated bug-finding tool.)
Another way to solve the memory leak issue is to replace the standard C++ pointer mechanism with a more advanced replacement. I have used both the Boehm Garbage Collector (GC) for C/C++ and the Boost library smart pointers for this. Since so much of the C library and other libraries you might be using depend on standard pointers, this is obviously not a panacea – there will always be places in your code where you will have to use a standard pointer or go around the constraints of the GC you are using. GC based approaches may not be appropriate for resource constrained environments or those where the GC’s requirement that it collect garbage might interfere with timing.
The smart pointers in the Boost library offer reference counting. While in theory reference counting requires more cycles of overhead than a GC based approach, it does offer a predictable use of CPU. Also, the freedom to create C++ objects that are deleted when the last reference to them goes away can allow much simpler designs in some cases, especially when an object is referenced by multiple threads. But be sure to configure the smart pointers to be threadsafe in this case.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Ben’s Golden Rule for Preventing Memory Leaks
As an embedded software engineer, I spent the majority of my time writing code in C and C++. One somewhat-justified knock on these languages is that they both force the programmer to sweat all the details involved with memory management.
One particular bugaboo of the C/C++ engineer is the memory leak: Memory that is dynamically allocated but, because of a programming error, isn’t returned to the system when it is no longer used. On an embedded device, leaked memory will grow over time, consuming more and more of the available resources until there is no choice but to reboot the device. And that’s a Really Bad Thing.
I’m not going to tempt fate by declaring that my code is impervious to memory leaks. Far from it! However, I have developed an approach to dynamically allocated memory that I like to see all my teams follow. With this approach, memory leaks tend to happen much less frequently.
It really boils down to a single, simple idea, which I immodestly call Ben’s Golden Rule for Preventing Memory Leaks:
A pointer variable is either referencing dynamically allocated memory, or it is NULL.
One of the nice things about setting a pointer to NULL is that, if you happen to dereference it by mistake, your program will immediately crash. This is a good thing, because hopefully it means the bug will be rather obvious and thus caught early in the testing cycle.
Here are some simple steps to implement Ben’s golden rule:
- Initialize all pointers to NULL. Don’t ever let a pointer hold random junk. Initialize it to NULL even if you are going to assign it in the very next line of code:
SomeComplexObject *my_object = NULL;
my_object = new SomeComplexObject();
- Reset pointers to NULL immediately upon freeing them. Even at the bottom of a function where your variable is about to go out of scope, you should pedantically reset it to NULL after a call to
free()ordelete. This way, if someone later refactors this code, or extends it, the likelihood of a bug creeping in is much lower.delete my_object;
my_object = NULL;
- Before assigning anything to a pointer, check to see if it is NULL unless the immediate line of code above the assignment sets it to NULL:
my_structure_t *a_struct = NULL;
// ... lots of stuff here, including possibly code that allocates a_struct
if (a_struct) {free(a_struct);
a_struct = NULL;
}
// now we know it’s NULL
a_struct = (my_structure_t *) malloc (sizeof(my_structure_t));
In some cases, tools like Coverity may complain about redundant checks for NULL. So you will have to be flexible here; if your coding standard (or your customer!) demands that your code be free of such warnings from Coverity, you will have to remove some of the checks. However, be aware that bugs may creep in during maintenance or refactoring.
- Be aware that several C library routines may perform “surprise allocations” by calling
malloc()internally. This depends on your C library, but typical calls include things such asstrdupandvasprintf. If you use such functions in your code, you will need to audit all code paths that use them to ensure that memory is freed in each path. - Don’t mix references to dynamically-allocated memory with references to other types of memory within the same pointer. An example of what NOT to do comes from some UI code I found, which I have obfuscated to protect the guilty party.
char *search_key = NULL;
// If MyCheckBox is selected, copy the search_key from the
// MyTextEntry widget; otherwise, set it to "not_used"
if (gtk_toggle_button_get_active(MyCheckBox))
search_key = strdup(gtk_entry_get_text(MyTextEntry));
else
search_key = "not_used";
// Ugh! At this point, the rule about whether or not you should
// call free(search_key) are confusing. Don’t do this!
// ... subsequent code does something with search_key ...
free(search_key); // note, this is a bug half the time
search_key = NULL;
- Initialize C++ member variables that are pointers to NULL in your constructor, and in your destructor, release all member-variable pointers, and then immediately set them to NULL.
Storage::Storage :
mPointerToData(NULL)
{}
Storage::~Storage()
{if (mPointerToData != NULL)
delete mPointerToData;
mPointerToData = NULL;
}
- In C++, use
std::stringinstead ofchar *. Thestd::stringclass handles all memory management internally, and it’s fast and well-optimized. - Make sure you match up your
[]operators.One odd bug unique to C++ programs can cause mysterious memory leaks. It is easy to allocate an array of items withnew [], and then usedelete, rather thandelete []to free the memory. Any time you allocate an array withnew [], you must audit your code to ensure that the deletion is always done withdelete [], rather than justdelete. The compiler will usually let you get away with it, but at runtime you will see memory leaks and allocation arena corruption.
Finally, another helpful design tool for avoiding memory leaks is to very clearly state in the design and code comments where specific items are created, and where they are deleted. Consider making the allocations all go through a single routine, and deletions through another. In C, these can be placed in a single source file, and in C++, you can make them methods of the same object. I’ll have more to say about this in a future post.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Detecting Well-Focused Images
Recently, one of my colleagues mentioned to me that he takes large numbers of pictures and wanted to write a program to automatically determine which was in the best focus, out of a group of pictures that were taken of the same scene. He mentioned that he expected the algorithm to be computationally intensive.
My initial reaction was “it can’t be that hard,” since my digital camera does it, and it has a small, low power processor in it. So I checked out the Wikipedia article on autofocus. Two general algorithms are described, active autofocus and passive autofocus. Since I’m examining images that have already been taken, active autofocus is not an option. Within the category of passive autofocus algorithms, Wikipedia describes two methods, phase detection and contrast measurement. Again, since my images have already been captured, I will use contrast measurement. Since I don’t have a DSLR, this is the method my cameras actually use to autofocus, so that’s kind of cool, too.
So what does an autofocus system using contrast measurement consist of? I imagine a lens system, a sensor, and a processor that reads the sensor which can then adjust either the positions of the optics in the lens system or the position of the sensor relative to the lens system. When the user wants to take a picture, the processor will read the sensor, make a contrast measurement, and repeat this process, sweeping the variable position part of the system through its range. Then it will select the position that gave the best contrast measurement. This matches with my experience of the view through my camera’s viewfinder, especially in low-light conditions: I see the autofocus sweep through a range until the image is sharp, and then the camera takes the picture.
Thinking about this, I imagine a picture with many black and white stripes—if the stripes are in perfect focus, there will be many large transitions in brightness (high contrast), and if the stripes are blurred, then we would expect the transitions in brightness to be less dramatic. So I made some test 100 × 100 pixel test images with horizontal and vertical lines in the GIMP to test my algorithm on.
![]() |
![]() |
| horiz.jpg | vert.jpg |
![]() |
![]() |
| horizblur.jpg | vertblur.jpg |
The blurred images are the result of running Filters->Blur->Blur in the GIMP over the original images.
So how to measure the contrast in a picture in a way that works quickly on a tiny low-power processor? The simplest method that came to mind was to sum the differences in pixel values across the image. This is a nice O(N) algorithm. Other methods such as computing FFTs and looking at the high order bins are more computationally intensive, and they don’t seem likely to be used in a consumer grade camera.
Ideally, we’d want to look at the difference in luminance in each pixel. Conveniently, during the JPEG compression process, you normally need to perform the linear algebra to convert the R, G, and B planes into Y (or luminance), Cr and Cb (chrominance) planes. So if we inserted the autofocus algorithm after this conversion in our low-powered camera, we could run it on just the luma plane. In addition to being less computationally-intensive, that might be more accurate, because the human eye has different sensitivities to different wavelengths of light, and using the luminance values would come closer to what we really want.
However, my algorithm operates in the RGB space, because that’s what my tools support easily. I decided to use GNU Octave, a free clone of MATLAB, to develop my algorithm. While I am mostly a C/C++/Java programmer, I sometimes use Octave to prototype algorithms because it is so concise and doesn’t require recompilation. In this case, I want to concentrate on contrast detection, and not the mechanics of reading and decoding JPEG files. We sometimes get algorithms from customers expressed as MATLAB or Octave code, and translate them to other languages for efficiency’s sake—for instance, we recently took some MATLAB code and implemented it in CUDA so that it would run at 24 frames per second. So here’s my contrast measurement program in Octave:
img=imread(“test.jpg”);
c=0;
for i=img
for j=1:rows(i)-1
c+=uint64(abs(i(j)-i(j+1)));
endfor
endfor
c
As you can see, Octave let me express this algorithm very concisely. First, I read in the image into the variable img. For an M × N image, this results in a 3 dimensional array of M × N × 3 RGB pixel values in the range 0-255.
I then iterate over the image, and each time i contains a vector of pixel values. Octave iteration and subscripting is a bit mysterious to me, and I always have to read the manual, but this is what it does in this case. It turns out that each of these vectors is a column vector of the image.
Then, for each of these column vectors, I accumulate the sum of the absolute value of the first differences in pixel values.
At the very end, I print out the value accumulated over the whole image. I should be able to use this program to evaluate the focus of two images, where the one with the larger result at the end should be more in focus (because it has higher contrast).
My first attempt at this program left out the uint64(), and I got results of 255 for the first images I tried. It turns out that Octave is doing saturation arithmetic, and so I need to cast the values to a wider integer. My camera takes full resolution images at 3648 × 2736, and so if we look at the worst case of a black and white checkerboard, we would accumulate a value of 3648 × 2736 × 3 × 255 = 0×1C71B1C00—a bit too large to hold in 32 bits. If we computed a combined brightness value for each pixel, we could use a 32 bit accumulator instead.
Running it over my test images above, I got the following results:
| Image | Contrast Measurement |
| horiz.jpg | 1248200 |
| horizblur.jpg | 407800 |
| vert.jpg | 0 |
| vertblur.jpg | 0 |
The horizontal stripe image works as one would expect, the in-focus image generating a higher metric than the blurred one. But what’s up with the vertical stripes? Well, it reminded me that my camera’s manual has an interesting section called “subjects that are difficult to focus on.” One of the examples given is “Subject without vertical lines.” I never understood this—until now. My camera must be looking at contrast across rows, so it has the same behavior of my Octave code, albeit in a different direction. This is further confirmation that my digital camera actually uses something very similar to this algorithm to autofocus. Let’s try the program on a real image, and a blurred copy:
![]() |
![]() |
| tree.jpg | treeblur.jpg |
Here are the results:
| Image | Contrast Measurement |
| tree.jpg | 880167 |
| treeblur.jpg | 338791 |
As we expect, the unblurred image gives a higher contrast measurement, so the algorithm is detecting that it is more in focus.
Finally, we should note that in actual digital cameras, depth of field is variable, and thus we might have to focus on just a particular portion of the frame, the rest of the frame might be intentionally be out of focus, in order to draw the viewer’s attention to the subject.

This would involve just running the algorithm over a subset of the image, such as the center of the frame. Many cameras also offer the option of automatically detecting human faces and autofocusing there. Wikipedia has an article on algorithms for detecting human faces in an image as well—but maybe that’s a topic for the future.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Uploading Kodak Zi8 Videos to Flickr
Recently my mom bought me a Kodak Zi8 pocket HD video camera for my birthday. Thanks, Mom! You know what an engineer likes! I love photography, and I upload my photos to the Flickr photo sharing site. But I think my mom wanted some more home movies of my daughter.

The first day I had it, I took some videos, and went to share them on Flickr. Unfortunately, the videos did not properly transcode to the Flickr Flash-based format, and Flickr displayed the following:

I checked out Flickr’s video uploading FAQ and discovered that the Kodak Zi6 and Zi8 are the only two unsupported cameras of all the video cameras in the universe. Great, so I have this fun new camera, and it won’t work with my favorite sharing site. The Flickr staff hasn’t come up with a solution in over a year. But I don’t give up easy.
I talked to my partner Howdy, and thought about the possible things that might cause problems with Flickr’s transcoding process. The Zi8 produces MOV files, and within the MOV file there is an audio stream and a video stream, each encoded with some codec. The problem could lie with the outer system level MOV container, or one or both of the codecs. I decided to look inside the files and see what codecs were in use. Howdy suggested using tools from mpeg4ip for this, but I decided to be lazy and use the VLC media player, as I already had it installed on my computer.
I opened up one of the movies recorded with the camera, and went to the Tools->Codec Information window in VLC. This showed me that the video codec was avc1 and the audio was mp4a.

Now, “avc1” is just h.264 by another name, and “mp4a” is AAC—standard MPEG-4 audio. Nothing looks too unusual. Well, MOV files are an Apple pseudo-standard that has evolved over time, so maybe the transcoder used by Flickr doesn’t like something about the MOV files produced by the Zi8.
Fortunately, I know I can convert from a MOV format container to a MPEG-4 container without transcoding either the video or the audio. Transcoding normally involves decoding and then encoding, so you want to avoid it for two reasons: One, it introduces an additional generation of lossy compression, so quality will suffer. And two, it’s computationally intensive, so therefore slow.
At the system layer, the MPEG-4 file format, formally specified in ISO/IEC 14496-14, is based on work originally done at Apple. As a result, the MOV and MP4 formats are very similar.
I decided to try ffmpeg, the open-source Swiss Army knife of video conversions. We have used ffmpeg on a number of different projects here at Cardinal Peak, and we’ve found it to be very reliable. It is available for Windows, Mac OS/X, and Linux. Because of its flexibility and its command line interface, it can be a little tricky to use, but I figured out after some experimentation that I could change the MOV file to an MP4 file with the following command:
ffmpeg -i input.MOV -f mp4 -vcodec copy -acodec copy output.mp4
In this command line, input.MOV is the file from the Zi8 or Zi6 (e.g., 115_0164.MOV), and output.mp4 is the name of the output file. The -f mp4 tells ffmpeg to convert the input to mp4 format, and the -acodec copy and -vcodec copy tell it to copy the audio and video data from the MOV container to the mp4 container without transcoding it.
The command executes quickly, because the audio and video formats are not changed. I tried uploading the resulting .mp4 files to Flickr, and they work fine! I’m happy because I can use my new video camera with my favorite sharing site, and hopefully help others out.

UPDATE 11/12/2009: It looks like Flickr might now be accepting the raw MOV files from the Zi8 directly. They have removed a previous caveat from their video uploading FAQ linked above. Please leave a comment if you have more information.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.




















