27 August 2007

More fun with md5sums: collisions

During my experiments with file duplication in the Debian archive, it occurred to me that having a list of files with identical MD5 hashes was a good starting point for finding MD5 collisions in the archive. If any of those files were different but had the same hash as computed by MD5, I'd have a collision.

Unfortunately, checking if the files differ involves extracting the data tarball of each affected package and computing another hash for the files (I used SHA-1), which takes a while. 3h32m and 47GB of extracted files later, I now have the results and there is no MD5 collision in the Debian archive. The chances were slim if not null, but at least now I'm sure.

(In fact I did find one occurrence, but it turned out that the file's MD5 hash in DEBIAN/md5sums was incorrect, for some reason.)

3 comments:

Anonymous said...

If there were a billion files, then the chance of a collision would be roughly one in 2**128/2**30, or 2**98.

Romain Francoise said...

Never tell me the odds!

Anonymous said...

are you sure you have not found a pair of files whos md5 and sha1 collide?