Hashing and Caching

DeWitt Clinton
August 2004

Following up on the earlier discussion of the SHA-0 crack, and the potential that MD5 and SHA-1 may be compromised as well, this addresses the implications on the Perl Cache library.

Th Perl Cache library uses a one-way hashing function in the FileCache implementation. For any given piece of data, the serialized image of that data is persisted to disk at a location identified by an absolute path (e.g., "/tmp/cache/"), an namespace (e.g., "myapplication"), and the application-specified key (e.g., "foo"). The last element is then hashed via a crytographic hash function into a hexadecimal ASCII string to provide a unique filename. This filename is examined for the first n characters (defaulting to 3), which are used to build a directory tree. The random distribution of the hashing function provides an evenly balanced distribution among cache elements across the filesystems.

For example, if the key is "foo", then the hashed unique key (via SHA-1) is f1d2d2f924e986ac86fdf7b36c94bcdf32beec15. This unique key is concatenated with the root filepath, the namespace identifier, and the first three characters of the unique key to create the absolute file path /tmp/cache/myapplication/f/1/d/f1d2d2f924e986ac86fdf7b36c94bcdf32beec15.

The only to arrive at that unique filename would be to know the root directory, the namespace, and the key for a particular piece of data. But once you knew that, you could easily and unambiguously locate the cached data.

In most applications there is no particular concern about whether or not the location of the file remains secret, insofar as the cache files are accessible only through the application layer.

However, imagine a scenario in which an application is using Cache::FileCache to store sensitive information, such as a banking statement. In this scenario, it takes a long time to lookup the customer records, those records are stored behind a key built from the customer's username ( e.g., "dclinton" ). After the records have been cached with $cache->set( "dclinton", $record ) they can easily be retrieved by calling $record = $cache->get( "dclinton" ). In this case, the banking record would be stored at /tmp/cache/bankrecords/7/c/3/7c3c15bfe853891d66d4bb900bccd5e3ea9229b2.

Now, the banking records would usually be safe from prying eyes, as the application itself would be supplying the username, and the security of the application is designed such that it is difficult to log in as someone else's username. (If it isn't difficult to do so, then the application has bigger problems.)

Moreover, since the only (statistically probable) way to get back the unique key 7c3c15bfe853891d66d4bb900bccd5e3ea9229b2 is to enter the input dclinton, and the string dclinton is (presumably) impossible to spoof, then that data is available only to the intended user.

However, what if you could pick another string out of the range of possible strings that also gave you the unique key 7c3c15bfe853891d66d4bb900bccd5e3ea9229b2? In fact, you can. However, since you have no way of knowing in advance what other string will generate that unique key, you'd have to try 2^80 different strings before you got a match (the square root of the size of the keyspace due to the birthday paradox).

Well, how long would that take? If you could hash and compare 10,000 strings a second on one machine (the 256-node cluster that broke SHA-0 claims to have used 80,000 CPU hours to search a keyspace of 2^51 -- showing roughly 30,000 keys per second.), and you had 1000 machines at your disposal, then it would take... oh, about one trillion times the age of the universe. Just a bit too long to wait.

But what if there was a crack in the cryptographic integrity of the hashing algorithm? What if that crack enabled you to reduce the searchable keyspace from 2^80 down to 2^48? Suddenly, generating an arbitrary key that matches 7c3c15bfe853891d66d4bb900bccd5e3ea9229b2 takes maybe half a day (10 hours) on a 1000-node supercomputer. And in a few years, it would take only half a day on a desktop computer.

Now imagine if this application let you choose your own username. So you search the keyspace to find another username that generates the same hash as dclinton, and log in via that hash-equivalent matching username to fool the application into giving you dclinton's banking records. (Though I can assure you that this account alone not worth getting the time on the supercomputer.)

Granted, this example is highly contrived. The chances of a matching key being found that also meets the requirements of being a valid username (i.e., less than 8 characters) are 1 in 10^36. The odds of knowing enough about the application to crack it like this (seeing as how the hash is hidden from the user, though see tomorrow's follow up on two-token security for where it does matter), are minimal, and would require inside information. (Unfortunately the odds of someone writing your unencrypted banking information to disk are probably pretty high.)

Cache::Cache currently uses SHA-1. I switched it off of MD5 a few years ago because of a bug with the Digest::MD5 implementation, but was happy to get the extra 32 bits of keyspace. However, since the chances of accidentally collisions is no lower than before (i.e., astronomically low), and it's unlikely that SHA-1 has been reduced to problem anywhere near the order of 2^48 (we'll wait and see), I see no reason to replace the current implementation at this time. There are other applications in which this will likely have an impact, but generating the cache key is not one of them.

However, if you're a user of Cache::Cache and you disagree with this assessment, please let me know and we can come up with a work-around.