We’ve discussed machine learning in a few previous blog posts, discussing best practices and how to overcome some of the inherent challenges in data set selection and training to use artificial intelligence in the real world. With the hard part out of the way, you’re now ready to hand it off to the other developers on the project so they can wrap your machine learning algorithms in some software and drop it on a server up in the cloud or embed it in a device. While you could spend your time filling out patent applications for all those cool new ideas you came up with in the algorithms, unforeseen challenges may still exist.
How the Real World Differs from the Lab
What works great in laboratory settings doesn’t always work as well in the real world. Accurate weather modeling has improved but continues to suffer from a lack of real-time data from sensors. Self-driving cars would be vastly improved if all the roads in the country had identical signage, lane markings and overall construction. Unfortunately, the real world just isn’t the same as a controlled lab environment. Artificial intelligence is great at many things, including analyzing reams of data quickly, but the technology has a difficult time deciding what to do during unfamiliar situations. When it comes to the impact of artificial intelligence in everyday life, variability is bad.
Let’s discuss three examples of variability that drive AI crazy and what you can do about it.
Artificial Intelligence Examples Impacted By Human Behavior
It may be obvious that human behavior can drive AI crazy — it drives us all crazy at times. But when a small group of dedicated engineers works on a product for months in a lab by themselves, they “teach” the AI how they alone function, and the machine learning algorithm becomes trained on that small set of human behaviors.
Imagine something as simple as a mobile app that takes a picture of something, and AI analyzes the image. For this example, let’s say the something is a mole on your skin, and the app is used to determine the type of mole and if it poses a danger to your health. As a developer, you’ve taken hundreds of pictures of moles on many unique people using a number of different phone cameras. You’ve created a massive data set of these images and matched each image with the correct pathology results. You’ve trained the algorithm and tested it against a portion of the images in the set. Everything is good until you hand it over to people to test during a trial.
During the initial uncontrolled trial, the app fails often. Why? When taking a picture of their mole, most of the people may have to contort themselves into weird pretzel shapes to get the picture of the mole on their back. Or they take the picture using the reflected image in a mirror. These tactics produce images that are not the same as the images you took in the lab. They have distinct shadows and odd reflections that confuse the algorithms. Some images may be out of focus, overexposed or blurred. Even taking the picture in direct sunlight may be a very different environment from the overhead lights in your lab environment. Worse yet, some of the moles may be on sunburned skin, which will confuse even the best algorithms that were trained with pictures of moles on “indoor” skin of many colors and varieties.
So what can you do to address this concern? Your data set must include these instances, or you need to prevent their occurrence. You could give an early device to novice users, see what their submitted images look like and include many of them in your machine learning training. Or you can reduce the likelihood of these irregular images being submitted by providing detailed instructions on how to take the picture. Provide suggestions on lighting, backgrounds, angles, etc. to produce more consistent images that the algorithms are familiar with. In short, it’s crucial to consider how people will use the device outside of the lab.
How Human Physiology Affects Artificial Intelligence in the Real World
If human behavior isn’t variable enough on its own, consider the differences between individual humans, the differences as people age and the differences that can be influenced by illness and medications. Designing algorithms for use in a medical device is a long process, not least of which involves testing the device on many different people under all sorts of different circumstances. Each of us is unique, and we change as we age. Our illnesses and the medications we take have many noticeable and invisible effects on how our bodies function.
Consider the design of a clip-on pulse oximeter, a device used to noninvasively determine the pulse rate and oxygen saturation level in your blood. The device functions by shining multiple frequencies of light through your skin and blood, sensing the output light levels. You’ve included different skin colors in the data collection testing and accounted for this in training the algorithms used to determine pulse rate and oxygen saturation. You have even anticipated that the device will be used on people that fidget or constantly moving babies. The algorithms used to detect motion and compensate for it can be difficult to design, and they require a great deal of experimentation to collect the data necessary for algorithm training.
You then begin a trial in a hospital, and the data collected shows that your device and your algorithm are performing well when compared to drawn-blood sample testing. Near the end of the trial, you test your device on patients that are very ill. They fall in and out of consciousness, experience long periods of trembling and shivering and are on multiple medications that constrict blood flow throughout their body. On these sick patients, your device does not perform well at all. For the people most in need, your device and the algorithms fall short of the desired result.
The data you collect on these patients is invaluable and must be incorporated into your overall training plan. More importantly, you may need to design new algorithms to detect some of these situations, like excessive periodic motion, and devise new algorithms to provide accurate results under these conditions. You may have to go back to the drawing board and start again. Worst case, you may need to restrict the use of your device and provide instructions informing the user that the device should not be used on small infants or patients experiencing excessive motion. Fortunately, these issues have been studied and resolved for pulse oximetry, but your new device will undoubtedly encounter its own set of unexpected situations to overcome. Consequently, it is important to be prepared to test many different people and incorporate the test results into algorithm design and training.
Device Variability and its Impact on Machine Learning in the Real World
You’ve been testing for months and your algorithms can handle any situation imaginable. You’ve tested with multiple different testers, under different circumstances and everything appears to be working great. If you’ve been able to perform your data collection and testing using a few hundred devices, then you’re lucky. But what if the device is expensive and you only have 10-20 hand-built devices for data collection, testing and final verification/validation of results? You’ve just encountered another source of variability: the device itself.
A few devices built in the lab by engineers with advanced degrees are very different from the hundreds of devices that pour off the assembly line after purchasing has applied their vendor selection preferences. Initial trials often start with the hand-built devices — because that is all you have, manufacturing has not been engaged at this point or the cost of the device is too high to justify burning through a lot of money on devices that cannot be sold because they were used in the test program. You’ve collected a lot of data using the first 10 devices built, and this data was used to train the algorithm. When you load the final algorithm into the same devices, they work very well and give you the expected results. Your boss is happy, the CEO is happy and the investors are less nervous.
But then you get to the final trial, a larger trial that requires more devices. The FDA provides guidance to help you determine the number of devices and patients required to conduct an appropriate trial. So additional devices are built in a new product realization (NPR) mode in the corner of the manufacturing shop. These are built using engineering specifications and drawings and often using the same supplier of parts as the original hand-built units. All the devices are within specifications when they are released for the trial. When these devices are used in testing, however, they may produce failing results or results outside of the expected range. It seems the algorithms have developed a “preference” for the original devices used to collect the initial test data. The algorithm designers may go so far as to advise the device engineers not to use the newly built devices and stick to the original devices for the testing. The problem is obvious, manufacturing will never be able to produce a device that works correctly.
This is a significant issue, often left undiscovered until the end of a project. To overcome this, you need to involve manufacturing and purchasing early in the development process. The specifications must be tight, realizable and testable to ensure compliance. Single-source vendors must be identified, if necessary, to minimize component variability. Manufacturing must create processes that will produce the same results over time, and the algorithm developers must understand that device variability is inevitable, even when all the devices are operating within desired specifications. Train the machine learning algorithm with as many devices as possible. Use devices that can be forced to operate at the edge of the specifications. Identify those parameters that must be tightly controlled so they don’t push the algorithms too far out of their familiar comfort zone. Creating great algorithms that are accurate on manufacturable devices is essential.
Overcoming Real-World AI and Machine Learning Challenges
These problems can be overcome if they are anticipated upfront. Replicate as much of the real world as possible during data collection and training, then test the final algorithms on representative devices in the real world with real people. Reduce variability as best you can and design algorithms to deal with the variability you can tolerate. Finally, make sure the user understands their part in variability reduction, and set limits of use for the variability that cannot yet be diminished or resolved.
The real world can be a nasty place, but that is where good artificial intelligence and machine learning are needed most.
From Bluetooth headphones with active noise cancellation and mobile apps to video streaming and medical test diagnosis, we uniquely understand machine learning and have employed it across a number of projects. If you’re looking to leverage machine learning for your next project, contact Cardinal Peak to inform us how we can help!