Archive for the ‘Software Development’ Category
High-School Prior Art
Based on a Slashdot link, I read a blog post today about the Oracle/Google lawsuit over Java virtual machine patents.
Among the list of software patents that Oracle has decided to sue Google with are two that I’m fairly certain that I wrote prior art for. This was in 1983 or 1984, while I was a high school student. This work was published in a book by the University of Oklahoma press, and I believe I still have a copy somewhere.
It seems that the bar for patents is not very high if patents are granted for something a high school student could come up with a decade earlier!
In ’83 or ’84, I wrote a hybrid interpreter/compiler for a Forth-like language.
I had read the book Threaded Interpretive Languages: Their Design and Implementation and decided to write such a language for my TRS-80 model 1. I gave it the snappy name, TIL16U, which stood for Threaded Interpretive Language, 16 bit unsigned ints. The only numeric datatype was the 16-bit unsigned integer, which also nicely represents a pointer datatype on the Z-80 processor.
I took a slightly different approach to writing the language than the book described. Initially I wrote a simple single-pass compiler from the source language to a simple bytecode in TRS-80 Level II BASIC.
Once I could generate bytecode from ASCII source, I wrote an interpreter for it in Z-80 assembly. The interpreter for such a simple stack based language is very small, and it was dwarfed by the inclusion of a text editor (incorporated from an article in 80 Microcomputing with a source listing) and cassette I/O routines to load/save my programs off to tape. I used the initial compiler written in BASIC to compile the runtime library for my interpreter.
My library data structure consisted of a routine name stored as an ASCII string, a length, a flag to indicate whether a given routine was implemented in Z-80 machine code or in bytecode, and then the routine body. While interpreting, the TIL16U interpreter would look at the bytecode for a call routine, and following this would be a routine name. The interpreter would look up the routine name in the library, and either jump to it directly (if it was in machine code), or continue interpreting at the start of the routine (if it was in bytecode).
At the time, I was very interested in writing videogames—my parents wouldn’t allow me to have an Atari 2600—and the 1.77 Mhz processor on the TRS-80 was not particularly powerful. So the lookup step bothered me performance-wise. I knew I could implement a symbol table and a second pass in the compiler to patch in the addresses of routines, but this made the language less dynamic. I wanted to be able to type in code interactively and have it work, and it was also a hassle to store addresses in the code because whenever I built a new version of the interpreter or the runtime library they might have to change.
So I hit upon the expedient of making the interpreter selectively compile bits of the running program to machine code as it went and overwriting the bytecode with machine code that could be directly jumped to, and pushing the address of the interpreter on the stack so that when a Z-80 RET instruction executed, the interpreter continued executing. I thought it was cool that the more you ran a program, the faster it would go.
In particular, the bytecode to invoke a library routine could be replaced by a three-byte Z-80 CALL instruction (a one byte opcode plus two bytes of address).
So when the interpreter found such a bytecode sequence, rather than execute it, it would look up the symbol address, overwrite it with the machine instruction, PUSH the return address of the interpreter on the stack and then jump to it. In the future, the interpreter would look at the opcode byte and know just to jump directly to the code sequence. (As an aside, it is interesting how much of the Z-80 instruction set is still burned into my brain. I can still remember an unconditional jump, or JP, is 0xC3, and the RET instruction is 0xC9, thanks to writing this code.)
I believe the symbolic routine name to numeric address is the mechanism described in one of the patents in the Oracle lawsuit, granted in 1992, and last updated on April 29, 2003.
Additionally, this patent, also referenced in the lawsuit, covers overwriting virtual machine instructions with native code and executing those instead of the bytecode. This patent was originally filed in 1997, and was last updated in 2005.
I’m not claiming that I have perfect prior art for these two patents—I haven’t even studied them very closely—but I believe the techniques I used are at least highly similar to what is patented. Only a patent court can decide if something infringes, anyway.
I’d like to claim this was entirely because of my brilliance, but I think it is more likely that the US Patent Office has been granting software patents that are so obvious that a reasonably bright teenager with a computer can come up with them.
To finish the story, the most successful application of TIL16U did actually turn out to be video games. My parents bought me a ChromaTRS color video board for my TRS-80 (“with 15 vivid colors!”), and I wrote several games for it, one of which I remember was a parachute jump game, where you jumped out of a plane and tried to land on a target. The plane speed and wind speed would vary, so you had to judge how the windsock was looking to get the high score.
Ben Mesander has more than 18 years of experience—not counting high school!—leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Real-Time Ethernet
A recent project for a customer was to implement a transport for a real-time signal processing application over Gigabit Ethernet (GbE). The project was especially interesting, because our customer’s requirement was for extremely low latency: The transport needed to be able to send replies to incoming packets with a custom EtherType within 74.4 μsec. This exceeded the capabilities of existing Ethernet protocols.
There are several approaches possible for real-time networking under Linux, including RTAI with the RTnet hard real-time network stack.
However, the customer was already using Red Hat Linux servers to perform the signal processing, so we decided to try the less intrusive Red Hat MRG kernel to see if it could be modified to incorporate the desired networking protocol. MRG incorporates many kernel patches to improve real-time performance and lower latency.
Since the timing requirements were so tight, I decided that only an in-kernel implementation was likely to succeed. If the code had to switch between kernel mode and user mode, this would require additional overhead, interactions with the Linux scheduler, and moving the data between the kernel and user address spaces.
I started development by building two server-class Linux machines running Red Hat Linux with the MRG kernel, connected back-to-back with a GbE cable. Then I wrote a user-mode test program to send an Ethernet packet of the appropriate type via a raw socket. I now had a way to send a packet from one machine to another. I set up Wireshark to monitor the traffic between the two machines
Initially, I attempted to modify the device-independent portion of the Ethernet stack within the Linux kernel to detect the incoming packet and respond to it. Unfortunately, I had to discard this approach when I found it could not meet the customer’s latency requirement.
So I experimented with modifying the Ethernet drivers for various GbE PCIe Ethernet cards in the MRG kernel. The basic approach I took was to invoke the outgoing packet interrupt service routine from the incoming packet interrupt service routine. The driver has to be in just the right state for this to work correctly, so it required some study of the driver source code. Fortunately most Linux Ethernet drivers share some common framework, so it did not require a complete re-engineering effort to try different cards, but there are definitely some differences.
I initially started with a Realtek card and driver, and had some success. I installed the kernel driver on both machines, so that once I sent a single packet from one machine to another via my user-space program, the two machines would continuously exchange Ethernet frames with each other at the maximum speed achievable.
This made measurement challenging because my modifications to the driver were at a low enough level that the outgoing packets did not go through the device-independent Ethernet stack in Linux, and thus were not captured by Wireshark; and additionally, Wireshark was too slow to keep up and often dropped packets. I was, however, able to use tcpdump to capture packet headers to a file and then look at them later with Wireshark to make measurements. The latency could be estimated by looking at the inter-packet timing from packets sent by the other machine and dividing by two. Using tcpdump mostly got rid of most of the packet losses, but it would still sometimes drop some. I resorted to putting a serial number in each packet which I could then examine to determine if I had lost a packet or not.
I also tried inserting a GbE switch between the two machines and monitoring packets from a third machine. However, I found the switch introduced 20 μsec of additional latency on average, and greatly increased the deviation of the measurements. And still, sometimes packets were lost. So I discarded this approach and used tcpdump running on one of the machines being tested.
I was unable to reliably meet the 74.4 μsec spec under load with the Realtek card, so the next card I tried was an Intel GbE card. This card used the Intel e1000e driver, which is supported directly by Intel rather than the reverse-engineered driver the Realtek card used. Unfortunately I was unable to find a card supported by the version of the e1000e driver contained in the MRG kernel, so I downloaded the latest Intel driver. This driver was significantly more complex than the MRG version of the driver, and I made some measurements.
Finally, I modified the Broadcom bnx2 driver in the MRG kernel. This driver is directly supported by Broadcom. I was able to use the MRG kernel driver with my card, and so I made appropriate modifications.
I found this card, like the Intel card, was also able to keep up with the desired data rate. The Broadcom card ended up having slightly lower latency measurements than the Intel card. I suspect this is due to the MRG kernel driver for bnx2 having less locking overhead and shorter code paths than the stock Intel e1000e driver.
In the end, with the Broadcom bnx2 driver with my modifications, we achieved an average latency measurement of 58 μsec, which was comfortably under the 74.4 μsec requirement. Additionally, we tested continuously over a several day period, monitoring for missing packet serial numbers, and none were detected.
The customer’s initial protocol required sending several smaller packets in either direction, but ultimately an additional speedup could have been realized by using jumbo frames. RTAI/RTnet could also have been used, but I do not think it would have been significantly faster, although it may have reduced the variability of the latency.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Writing Defensive Code to Automatically Find Memory Leaks
In an earlier post, I suggested making all your memory allocations go through a single routine, and deletions through another. When you centralize allocation and deallocation like this, you gain a couple of benefits.
- First of all, you make the memory allocation more explicit, which will tend to make programmers more careful.
- Second, you can write “defensive code” to automatically find bugs during the development process.
I first thought hard about the concept of “defensive code” when I read the book Writing Solid Code by Steve Maguire. This book is now slightly dated, but the general concepts in it are excellent. Defensive code refers to the concept that, especially in debug code used during development, you should have validation routines that are called automatically that will call your attention to latent bugs such as memory leaks.
Once you’ve got your memory allocation and deallocations routines centralized, there are various ways you can instrument these routines during development so your code can tell you about memory allocation problems as you’re developing it.
The easiest thing to do is to keep counts of how many chunks of memory have been allocated and deleted, and monitor it over time to see if the number of currently allocated chunks matches your expectation. This is lightweight enough to implement even in the most constrained environments.
A variation is to keep a copy of the pointer in the allocation routine in some kind of registry data structure. Then you can write a routine to print out the currently allocated items. This is especially helpful if a type of item is leaked only occasionally, as it lets you see exactly what items have leaked. (I have also implemented this in conjunction with Boost smart pointers, but be aware this keeps the reference count at 1, so you have to also write code to release this pointer when the reference is 1 in your printout routine or the items are never deallocated.)
In many embedded environments, you may have the source code to the C library, or even the entire operating system available. If something like valgrind is too resource intensive or unavailable for your platform, you can look at some of the malloc debugging packages available on the net. If none meets your needs or is unavailable on your platform, you can simply modify malloc() and free() to record operations. You will want to log the address and size of each allocation and deallocation to a file or trace facility. You will also find it useful to record the addresses on the call stack, such as with GCC’s __builtin_return_address() facility. You can write a program to examine the saved allocation data and delete all the records that have matching allocation / deallocations. The results left over are malloc() calls with no matching free() calls, and double calls to free(). You can then match up the calling addresses with a symbol table, the output of nm, or the gdb list *address command to find out where these calls were made in your program.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includesembedded software, scientific software and enterprise software development environments.
Tools to Help Find Memory Leaks
Previously, I wrote about my “golden rule” for reducing the prevalence of memory leaks in your code. One other easy way to prevent memory leaks is to actively seek them out during the development process, instead of waiting around for them to be reported to you as bugs.
One good way to do this is to add some time to your schedule so that you can apply the available tools for finding memory leaks and other problems. Two that I have used and found valuable are Valgrind and Coverity.
Valgrind performs many checks on your code, but in my opinion the most useful ones are related to pointer usage and memory allocation.
Valgrind analyzes your program at runtime, and requires significant resources. It may be unsuitable for some embedded applications for that reason, but if you can sufficiently beef up a development system’s memory and storage to run it, or execute your code in another more resource-rich environment, you will find it useful.
The weakness of Valgrind is that since it runs at runtime, you have to be able to exercise many code paths to ensure that you are leak free. I have found that even if the usual flow of execution for an application is leak-free, it is likely that memory leaks occur in error paths.
Coverity, on the other hand, performs static analysis on your source code, as described in this paper. So it covers all your error paths. Coverity costs money, but in my opinion it is well worth it. While it can provide false positives in some cases, the value is in the actual problems it uncovers that are difficult to find with a code review. It is worth examining every item it identifies as being a potential problem. (Incidentally, I recommend this recent article, written by the founders of Coverity, which touches on the organizational difficulties in implementing any automated bug-finding tool.)
Another way to solve the memory leak issue is to replace the standard C++ pointer mechanism with a more advanced replacement. I have used both the Boehm Garbage Collector (GC) for C/C++ and the Boost library smart pointers for this. Since so much of the C library and other libraries you might be using depend on standard pointers, this is obviously not a panacea – there will always be places in your code where you will have to use a standard pointer or go around the constraints of the GC you are using. GC based approaches may not be appropriate for resource constrained environments or those where the GC’s requirement that it collect garbage might interfere with timing.
The smart pointers in the Boost library offer reference counting. While in theory reference counting requires more cycles of overhead than a GC based approach, it does offer a predictable use of CPU. Also, the freedom to create C++ objects that are deleted when the last reference to them goes away can allow much simpler designs in some cases, especially when an object is referenced by multiple threads. But be sure to configure the smart pointers to be threadsafe in this case.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Ben’s Golden Rule for Preventing Memory Leaks
As an embedded software engineer, I spent the majority of my time writing code in C and C++. One somewhat-justified knock on these languages is that they both force the programmer to sweat all the details involved with memory management.
One particular bugaboo of the C/C++ engineer is the memory leak: Memory that is dynamically allocated but, because of a programming error, isn’t returned to the system when it is no longer used. On an embedded device, leaked memory will grow over time, consuming more and more of the available resources until there is no choice but to reboot the device. And that’s a Really Bad Thing.
I’m not going to tempt fate by declaring that my code is impervious to memory leaks. Far from it! However, I have developed an approach to dynamically allocated memory that I like to see all my teams follow. With this approach, memory leaks tend to happen much less frequently.
It really boils down to a single, simple idea, which I immodestly call Ben’s Golden Rule for Preventing Memory Leaks:
A pointer variable is either referencing dynamically allocated memory, or it is NULL.
One of the nice things about setting a pointer to NULL is that, if you happen to dereference it by mistake, your program will immediately crash. This is a good thing, because hopefully it means the bug will be rather obvious and thus caught early in the testing cycle.
Here are some simple steps to implement Ben’s golden rule:
- Initialize all pointers to NULL. Don’t ever let a pointer hold random junk. Initialize it to NULL even if you are going to assign it in the very next line of code:
SomeComplexObject *my_object = NULL;
my_object = new SomeComplexObject();
- Reset pointers to NULL immediately upon freeing them. Even at the bottom of a function where your variable is about to go out of scope, you should pedantically reset it to NULL after a call to
free()ordelete. This way, if someone later refactors this code, or extends it, the likelihood of a bug creeping in is much lower.delete my_object;
my_object = NULL;
- Before assigning anything to a pointer, check to see if it is NULL unless the immediate line of code above the assignment sets it to NULL:
my_structure_t *a_struct = NULL;
// ... lots of stuff here, including possibly code that allocates a_struct
if (a_struct) {free(a_struct);
a_struct = NULL;
}
// now we know it’s NULL
a_struct = (my_structure_t *) malloc (sizeof(my_structure_t));
In some cases, tools like Coverity may complain about redundant checks for NULL. So you will have to be flexible here; if your coding standard (or your customer!) demands that your code be free of such warnings from Coverity, you will have to remove some of the checks. However, be aware that bugs may creep in during maintenance or refactoring.
- Be aware that several C library routines may perform “surprise allocations” by calling
malloc()internally. This depends on your C library, but typical calls include things such asstrdupandvasprintf. If you use such functions in your code, you will need to audit all code paths that use them to ensure that memory is freed in each path. - Don’t mix references to dynamically-allocated memory with references to other types of memory within the same pointer. An example of what NOT to do comes from some UI code I found, which I have obfuscated to protect the guilty party.
char *search_key = NULL;
// If MyCheckBox is selected, copy the search_key from the
// MyTextEntry widget; otherwise, set it to "not_used"
if (gtk_toggle_button_get_active(MyCheckBox))
search_key = strdup(gtk_entry_get_text(MyTextEntry));
else
search_key = "not_used";
// Ugh! At this point, the rule about whether or not you should
// call free(search_key) are confusing. Don’t do this!
// ... subsequent code does something with search_key ...
free(search_key); // note, this is a bug half the time
search_key = NULL;
- Initialize C++ member variables that are pointers to NULL in your constructor, and in your destructor, release all member-variable pointers, and then immediately set them to NULL.
Storage::Storage :
mPointerToData(NULL)
{}
Storage::~Storage()
{if (mPointerToData != NULL)
delete mPointerToData;
mPointerToData = NULL;
}
- In C++, use
std::stringinstead ofchar *. Thestd::stringclass handles all memory management internally, and it’s fast and well-optimized. - Make sure you match up your
[]operators.One odd bug unique to C++ programs can cause mysterious memory leaks. It is easy to allocate an array of items withnew [], and then usedelete, rather thandelete []to free the memory. Any time you allocate an array withnew [], you must audit your code to ensure that the deletion is always done withdelete [], rather than justdelete. The compiler will usually let you get away with it, but at runtime you will see memory leaks and allocation arena corruption.
Finally, another helpful design tool for avoiding memory leaks is to very clearly state in the design and code comments where specific items are created, and where they are deleted. Consider making the allocations all go through a single routine, and deletions through another. In C, these can be placed in a single source file, and in C++, you can make them methods of the same object. I’ll have more to say about this in a future post.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
How CUDA can impact your world
I am in San Jose at the GPU Technology Conference, where I spent yesterday listening to some excellent talks about CUDA.
We’ve blogged before about CUDA, which is one of a handful of emerging technologies that allow applications to run compute-intensive tasks on the graphics processing unit of the computer.
Of course, here at the show we are getting the sales pitch from Nvidia, which isn’t entirely a bad thing, since it’s also educational. Something that strikes me is that every card sold by Nvidia since 2006 is CUDA-capable, which means the likelihood of the average desktop computer having a CUDA-capable card in it is becoming pretty high.
I spent 11 years working for a company that produced desktop software for scientific analysis in the oil exploration industry. I was very excited by the idea that every desktop computer now has the ability to perform highly parallel computations in a very affordable package.
In his keynote, Nvidia CEO Jen-Hsun Huang showed some slides that did what he called CEO math. This math showed performance enhancements on code that was conducive to parallel development; the performance gain from moving to CUDA was on the order of 100-200x. Wow—that’s two orders of magnitude.
The drawback to CUDA is still the need for specialized code, but as we’ve gotten more familiar with this technology, we can attest that—for certain applications—there can be a pretty dramatic performance gain from a relatively modest engineering investment.
Pretty exciting stuff.
Chad Scates has significant experience managing software development teams and developing GUI-intensive applications in a cross-platform environment (Windows, Linux, Solaris, Irix).
GCC’s Unhelpful Error Messages
Something must be done about the obtuse error messages issued by the GCC compiler, particularly when using C++ and STL classes. Take this, for example, which is the output I got recently after a one-line change:
src/emmserver.cpp:739: error: conversion from 'std::_Rb_tree_iterator<std::pair<const unsigned int, EMMServer::EMM> >' to non-scalar type 'std::_Rb_tree_iterator<std::pair<const unsigned int, std::map<short unsigned int, EMMServer::EMMSource, std::less<short unsigned int>, std::allocator<std::pair<const short unsigned int, EMMServer::EMMSource> > > > >' requested
src/emmserver.cpp:740: error: no match for 'operator==' in 'j == ((EMMServer*)this)->EMMServer::mEMMSources. std::map<_Key, _Tp, _Compare, _Alloc>::end [with _Key = unsigned int, _Tp = EMMServer::EMM, _Compare = std::less<unsigned int>, _Alloc = std::allocator<std::pair<const unsigned int, EMMServer::EMM> >]()'
/usr/lib/gcc/powerpc-unknown-linux-gnu/4.1.2/include/g++- v4/bits/stl_tree.h:210: note: candidates are: bool std::_Rb_tree_iterator<_Tp>::operator==(const std::_Rb_tree_iterator<_Tp>&) const [with _Tp = std::pair<const unsigned int, std::map<short unsigned int, EMMServer::EMMSource, std::less<short unsigned int>, std::allocator<std::pair<const short unsigned int, EMMServer::EMMSource> > > >]
It went on like this for probably 20 more lines of near-gibberish.
Andy Balaam has a humorous blog post where—almost as an aside—he generates a 1950-byte error message from a source program of only 181 bytes.
There is clearly a problem when someone starts an open source project just to post-process compiler error messages so as to make them readable. I do use STLFilt, and it’s very helpful. But it’s fixing a problem that shouldn’t exist in the first place, because our compiler tools should be better at producing useful error messages that are intelligible by mere mortals.
The problem was brought to the attention of the GCC development team almost exactly five years ago today. Unfortunately, they come down on the side of pedantic correctness instead of producing a tool that might be more useful to their end users.
As soon as I have a break on my current project, I’m hoping to investigate Clang/LLVM. It’s a new compiler technology, and I’m hopeful because they list expressive diagnostics as one of their goals. However, they do not yet support C++.
Ben Mesander has more than 18 years of experience leading software development teams and implementing software. His strengths include Linux, C, C++, numerical methods, control systems and digital signal processing. His experience includes embedded software, scientific software and enterprise software development environments.
Working with CUDA
We’ve recently been working with a cool technology that is rapidly penetrating scientific and engineering computing, but seems little known otherwise. It’s called CUDA. In a nutshell, it is an SDK to allow you to run parallelizable compute-intensive applications on your Nvidia graphics card instead of serially on your CPU.
CUDA is one of a number of emerging methods that all more or less enable the same concept. A similar approach is behind OpenCL, which is backed by Apple and purports to be more cross-platform, in that it eventually will allow developers to develop code for a range of GPUs and not just those from Nvidia. However, at the moment OpenCL is limited to Macs running Apple’s new Snow Leopard, so I’m not sure that’s more open than CUDA.
Our use of CUDA involves running a moderately intensive frame-based image-processing algorithm on extremely high-definition images. The images are 3840×1080 monsters, and the goal is to process a total of 24 such images per second, so it’s definitely not something you can accomplish on a CPU alone.
Our client in this case had originally implemented the image-processing algorithm in MATLAB, so our first task was to convert from MATLAB to C. The resulting C code was the basis for using Nvidia’s CUDA SDK to get the algorithm running on the Nvidia board. We also had to tie the entire framework into Microsoft’s DirectShow architecture, because the data is delivered to us as H.264 encoded images.
The parallelization comes through the GPU chip’s many stream processors—also called thread processors. For example, the GeForce 8 GPU has 128 stream processors. Each stream processor comprises 1024 registers and a 32 bit floating point unit. The stream processors are grouped on the chip into clusters of 16. Each cluster shares a common 16 KB memory and is referred to as a “core.”
From a software perspective, each stream processor executes multiple threads, each of which has its own memory, stack, and register file. For the GeForce 8 GPU, each stream processor can handle 96 concurrent threads (for a total of 12,288) although this number is seldom reached in practice. Fortunately, programmers do not need to write explicitly threaded code since a hardware thread manager handles threading automatically. The biggest challenge for the programmer is to properly decompose the data so that the 128 stream processors stay active.
In an efficient decomposition, the data is subdivided into chunks that can be allocated to a core. Algorithms execute most efficiently when the data for the threads executing on a core can all be stored in the core’s local memory. Data is transferred back and forth between the host CPU and the GPU’s global memory (i.e. graphic device memory) via DMA. Data is then transferred back and forth from device memory to core memory as needed. An efficient implementation will minimize the number of device to core data transfers.
In CUDA terminology, a block of threads runs on each core. While one thread block is processing its data, other thread blocks (running on other cores) are performing identical operations in parallel on other data. Thread blocks are required to execute independently and they must be able to be executed in any order. To specify the operations of a thread block, programmers define “C” functions, called kernels. The kernel function is executed by each thread in a thread block. Fortunately, there is specific CUDA support for synchronizing the threads within a block.
We have enjoyed this project immensely and are excited by CUDA’s prospects. We look forward to using CUDA on other projects in the future. More information on CUDA can be found on Nvidia’s website. I found this article particularly useful as an introduction.
Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.