Hash Function Integrity

DeWitt Clinton
August 2004

As mentioned on Slashdot yesterday, there are reports that the SHA-0 hashing function has been demonstrated to be cryptographically insecure. In the particular case of SHA-0, this is not a surprise -- SHA-0 has long been known to be vulnerable. In fact, the National Security Agency itself stopped the roll-out of SHA-0 as a standard, and instead proposed the changes that led to SHA-1. The NSA never revealed the weaknesses however, so the announcement that Antoine Joux uncovered the flaw is academically interesting news.

However, it's not that important in the real world, as few, if any, people are using SHA-0. What is far more interesting is the claim that a vulnerability in MD5 may also be announced in the near future. And potentially in SHA-1 as well.

The latter two announcements are very relevant, as the integrity of the one-way relationship of those cryptographic hash functions is a central component in numerous password authentication and online security models.

For those that are unfamiliar with the concept of a one-way hash, it is essentially this: Given any arbitrary piece of data, be it a text document, a binary file (such as a JPG or a software application), or a password string, there can exist a cryptographic function that encodes the original data such that the encoded form is a) the same every time the function is applied, b) very hard to work backward from to find the original data, c) very hard to find another piece of data that would create the same hash, and d) collision-resistant insofar as that it is unlikely that two pieces of data will create the same hash.

For example, say I took the string "The quick brown fox jumps over the lazy dog." and ran the MD5 function over it. The result would be "0d7006cd055e94cf614587e1d2ae0c8e". That is a 128 bit string comprised of 32 characters, each with 4 bits of information represented by the base-16 number 0-9a-f. Now, working from that result string, it would be computationally difficult to come up with the original string. So difficult, in fact, that barring any shortcuts, there isn't enough time in the universe to do it. (In the real world, of course, there are shortcuts -- simply by guessing that the original string is in English reduces the problem size.) Moreover, it is (almost) equally difficult to come up with any other string that produces the same result. (Actually, since the difficultly is the square root of 2^128, i.e., 2^64, people feel that MD5 is vulnerable to a brute force attack -- but that's a lot of brute force, and just beyond conventional means today.)

SHA-1 goes even further, and produces the string "9c04cd6372077e9b11f70ca111c9807dc7137e4b" from the same input. That's 40 4-bit characters, for a total of 160 bits of data. This is large enough that, if SHA-1 is sound, we really don't have to worry about brute-force attacks in the foreseeable future. (The attack on SHA-0, which also hashes to 2^160 bits, worked because they reduced the problem to a "mere" 2^51, which they were able to brute force using 80,000 hours of CPU time on a 256-node supercomputer.)

The one-way nature of this relationship of these hash functions makes it possible to improve computer security (and credit card security, and online transaction security, etc.) by avoiding the weakness in storing passwords. The problem with storing plain-text passwords is that if anyone can read the password file, they have access to the data it is protecting. And typically, all of those passwords would have to appear in the same place, for all accounts on that machine.

However, if you only store the hash of the password, then access can be denied unless the end-user can come up with a string (i.e., the password itself) that produces the same hash. Since the chances of collision are minimal and since the hash function is one way, it is very difficult to crack. (Though again, in the real world, some obvious assumptions about what people will chose for a password, such as short English words, will vastly reduce the problem space.) And moreover, discovering the hash itself reveals nothing about either the original password nor any other password used on that machine.

The original /etc/password file on Unix machines used a weaker 56-bit encryption scheme called Data Encryption Standard (DES) which was adequate in the 1970's but gradually fell to brute force attacks as computers got faster. These days a stronger 128-bit password hash is used in almost all systems.

So these announcements may have consequences -- though not particular severe ones if the problem is reduced in complexity only to a size of around 2^64. This is still brute-forcible on a supercomputer or with a distributed network, but not on a case-by-case basis. However, if the attacks prove to be able to reduce the problem to 2^48 or so, then all of our existing security models need to be re-evaluated ASAP.

Tomorrow I will write up the potential impact on the two token security model that I described a few years ago (and most secure e-commerce sites use at least a variant of), and also discuss the impact on the data caching library that I developed that a fair number of online sites rely on.