A study of file duplication in the Debian archive

It all started about a week ago when I decided to find out how many files were unique in the whole Debian archive, and how many were duplicated in the same package or in other packages. I had done some work on duplication detection before, and I knew that the process involves getting some kind of hash value of every document, then finding duplicate values in the list of hashes.

I already had a local Debian mirror so the raw data was all there. I basically had two options: either compute the hash value of every file myself from the packages in my mirror, or find some other (ideally faster) way to determine if a file is unique. I quickly decided that the best option was to use the md5sums embedded in most Debian packages (in the DEBIAN/md5sums control file). That would give me MD5 hashes of all regular files, excluding conffiles.

So the first step was to check how many packages have embedded md5sums, and a simple script showed that less than 3% don't have them. This first check exposed a bug in python-debian, which was duly reported. Along the way, my post prompted a discussion on debian-devel about the state of md5sums, and I set up a daily check to keep track of things.

The next step was to make sure that all md5sum-enabled packages had usable information. It turned out that python-debian choked on 103 packages because of embedded spaces in filenames, and that it also had a blocker bug when used in Python 2.4. I filed two more bugs.

All that remained was the easiest part: write the program to find duplicate files. I did that, and I now have all kinds of funny statistics:

If you're still reading so far, thanks! You can see the report file, and the program itself. If you want to run it, note that it's optimized for speed rather than memory efficiency; it runs in under 3 minutes but uses up to 1.5GB of memory.

The grand conclusion is that all things considered, there is very little file duplication in the archive: the #1 duplicate represents less than 0.4% of the two million analyzed files, and it doesn't actually use any space since it's an empty file...

Posted August 21, 2007 #
Previously: Debian packages without md5sums