I’d like to take a deeper look at this bug and the process that allowed it to be created. Catastrophic failures are often caused by a chain of bad decisions or errors that finally result in a systems collapse — and that’s the case here. Contrary to what’s been reported, I don’t see this as a single, simple bug.
The feature in question is a “heartbeat” function used for one host to check to see if another host is alive. I’m personally not a fan of keep-alive probes. I always go back to the late W. Richard Steven’s TCP/IP Illustrated Volume 1, chapter 23, where he mentions the controversy about the 2-hour keep-alive timer in TCP. I understand that half-open TCP connections can be difficult to detect in some cases, but the fact is, even if a keep-alive succeeds, the next packet sent may fail as the network can go down at any time. The keep-alive doesn’t give you real assurance that the link between two hosts is up at any moment other than the actual keep-alive. So it doesn’t really do what people think it does.
RFC 1122, the Host Requirements RFC, section 126.96.36.199 states that keep-alives are optional behavior in TCP, and further states:
A TCP keep-alive mechanism should only be invoked in server applications that might otherwise hang indefinitely and consume resources unnecessarily if a client crashes or aborts a connection during a network failure.
My take on this is if you are designing such a server, you have already made a mistake that keep-alives won’t protect you from. Servers with this type of design simply will not scale. Servers need to accept that clients on the other end of connections come and go, and ideally should not keep any per-client state. While this ideal cannot always be realized, server side state should be entirely minimal. Keep-alives themselves also have scaling problems. Imagine if every idle connection on the Internet sent a keep-alive packet, say, every 30 seconds. The result of this thought experiment is every core router on the Internet would melt down into slag in minutes. If you’re designing protocols that scale, you don’t use heartbeats.
So my first thought was, what a misfeature — what is it even doing in SSL/TLS? TCP already has a keep-alive feature; why is one necessary at the encryption layer above it? The answer can be found in the TLS & DTLS Heartbeat Extension RFC. There it is explained as allowing for keep-alive functionality and for path maximum transmission unit (PMTU) discovery in DTLS (the Datagram version of TLS). OK, you get the idea that I think keep-alives are usually not a good idea, but let’s discuss PMTU discovery.
One evil of the modern Internet is the fact that PMTU (Path Maximum Transmission Unit) discovery is totally broken. The idea behind PMTU discovery is that two hosts dynamically discover the largest packet that can be sent between them without fragmentation, and then communicate using that packet size. Originally, PMTU discovery was done with ICMP packets. However, unsophisticated network administrators in the late 90s started to block all ICMP packets in firewalls due to bugs such as the so-called “ping-of-death“, thus breaking this elegant mechanism. Blocking all ICMP rather than just problematic ones actually does not confer any security benefit — it just renders some hosts unable to communicate with others. It has been known to be a problem for a long time, but educating firewall administrators proved fruitless.
So, to work around the widespread blocking of ICMP packets, RFC 4821 was developed, which uses regular protocol packets to probe the MTU size. These regular packets can’t easily be filtered at the firewall, so path MTU discovery works via an alternate mechanism now.
So the designer of the TLS heartbeat extension intended for it to be usable to hack around firewall misconfiguration caused by bugs in old stacks from the 90s:
DTLS performs path MTU discovery as described in Section 188.8.131.52 of . A detailed description how to perform path MTU discovery is given in . The necessary probe packets are the HeartbeatRequest messages.
And it is also intended to check aliveness more rapidly than the TCP keep-alive feature (again, a bad idea in my opinion):
Sending HeartbeatRequest messages allows the sender to make sure that it can reach the peer and the peer is alive. Even in case of TLS/TCP this allows a check at a much higher rate than the TCP keep-alive feature would allow.
The first use case requires a payload of variable length to be sent. The second does not. It appears the author decided to not treat TCP and UDP transports separately, and provide the same functionality on both. Some justification here could be provided by realizing that TLS/DTLS could theoretically be run on transports other than TCP and UDP, although I suspect such usages are uncommon and after all, the only reason for this feature is because of deficiencies in the current state of TCP and UDP!
So we have a hack here, and one of the things about hacks is they often have unintended consequences. What about the 64K payload length of the keep-alive packet? How can it be justified? Well, let’s look at an IPv4 packet going through a socks5 proxy — the IP header has a 16-bit length field. UDP packets have a redundant length field in them, also 16 bits. Then the DTLS heartbeat packet has a 16-bit length field within it. All these lengths redundantly provide the same information to the IP stack. One of the things I try to do while programming is to absolutely minimize redundant information — I have a strong belief that redundantly storing the same information in multiple places in a program’s data structures leads to nothing but bugs due to the potential for the information getting out of sync. Even if the program is perfectly coded initially, over time, as it undergoes maintenance, these values tend to not be reliably maintained in sync with each other. The relational database folks understood this in the 1970s: They called this important design principle — Third Normal Form. I’m sure the justification for this redundancy is the layering concept that different transports can be plugged in and out under TLS/DTLS. Still, just like TCP/IP stacks must be very careful to ensure that the redundant UDP packet length and the IP packet length are in concordance, the heartbeat extension authors should have ensured the same check was done at the higher protocol layers, and considered that protocols that don’t have underlying length information in them are few and far between before redundantly storing the same information a third time in a single data structure.
A typical IPv4 header is 20 bytes. With all sorts of options turned on, it can be up to 60 bytes in length. A UDP datagram header is 8 bytes in length. The heartbeat header is 3 bytes, and it requires at least 16 bytes of trailing padding. So the safe maximum length for a heartbeat payload would be 65535 – 60 – 8 – 3 – 16 = 65448 bytes. In fact, the heartbeat spec says the maximum payload length allowed is 2^14 = 16384, although since they allow 0 length payload, I’m going to assume this was the classic off-by-one programmer error and the spec was actually meant to be (2^14)-1 =16383 bytes maximum payload length. So another bug in the spec here.
Which is fine. The problem I have with this is we’re storing a quantity that is supposed to be a 14-bit integer in a 16-bit field. The top two bits should have been marked as reserved and set to zeroes, and even if nonzero, should not have been defined or used as part of the length field. This is a further design bug.
Finally, we come to the implementation in OpenSSL. It has been pointed out by a programmer with impeccable security credentials that the homegrown malloc/free implementation is a key part of what made this bug so dangerous — another design bug. But even if that were not the case, the buffer overrun part of the bug would still have potentially leaked memory on systems that don’t scrub the heap after freeing memory for efficiency or implementation reasons.
My first thought upon seeing the source was “Coverity would have caught this bug.” Coverity is a static C/C++ checker that examines your code for bugs like this. There are other tools that do the same thing — Coverity is just the one I’m most familiar with. I know Coverity often offers free scans to open source developers and I wondered if they had scanned OpenSSL. I uncovered this damning exchange on an OpenSSL mailing list.
Basically, because Coverity discovers false positives (and it does, you need to either adjust your coding standards to make things clear to Coverity or configure it not to complain about such things), the OpenSSL developers said they wouldn’t use it to find bugs until the Coverity developers fixed all their bugs. This is an epic case of hubris. I remember the first time I used Coverity it found hundreds of non-errors in my code, which was disappointing, and it found a few that I believe we would have caught in code review, but it definitely found one serious concurrency bug that I am absolutely confident I would have shipped with if it was not for the tool. I came away very impressed and I feel tools like these are worthwhile for what they do catch despite the potential for a large number of false positives.
Also note that because of the replacement of the system malloc/free in OpenSSL, other tools that do runtime analysis, such as valgrind, would not find the bug. This is another reason that feature of OpenSSL was a bad idea.
So, as is typical of large disasters, there isn’t just one simple root cause for the heartbleed bug. Depending on how you count them, perhaps eight or nine bugs worked together over a decade to get us where we are today:
- Indiscriminate blocking of ICMP packets in firewalls largely due to 90s TCP/IP stack bugs in Microsoft Windows;
- Adding controversial features to a protocol;
- Adding a feature (PMTU discovery) for all protocols because it was necessary for one protocol (UDP);
- Redundantly specifying the same information multiple times in a data structure;
- An off-by-one error;
- A length field allowing more space than necessary;
- Failure to validate user input;
- Failure to use tools available to look for bugs; and
- Replacing core OS functionality such as heap management at the secure network layer.
All played a part in the heartbleed bug. While the “fix” this time is a single line of code in OpenSSL, I am quite sure many more bugs lurk given the chain of events that got us here.