[Long post, sorry. If you're short on time skip to the end for the juicy parts.]
It all started about a week ago when I decided to find out how many files were unique in the whole Debian archive, and how many were duplicated in the same package or in other packages. I had done some work on duplication detection before, and I knew that the process involves getting some kind of hash value of every document, then finding duplicate values in the list of hashes.
I already had a local Debian mirror so the raw data was all there. I basically had two options: either compute the hash value of every file myself from the packages in my mirror, or find some other (ideally faster) way to determine if a file is unique. I quickly decided that the best option was to use the md5sums embedded in most Debian packages (in the DEBIAN/md5sums control file). That would give me MD5 hashes of all regular files, excluding conffiles.
So the first step was to check how many packages have embedded md5sums, and a simple script showed that less than 3% don't have them. This first check exposed a bug in python-debian, which was duly reported. Along the way, my post prompted a discussion on debian-devel about the state of md5sums, and I set up a daily check to keep track of things.
The next step was to make sure that all md5sum-enabled packages had usable information. It turned out that python-debian choked on 103 packages because of embedded spaces in filenames, and that it also had a blocker bug when used in Python 2.4. I filed two more bugs.
All that remained was the easiest part: write the program to find duplicate files. I did that, and I now have all kinds of funny statistics:
- The dataset consists of 20170 packages with md5sums, shipping a total of 2069830 files. That gives an average ratio of 102 files per package, excluding conffiles.
- There are 113732 duplicates in the archive. 1556 files are duplicated more than 10 times, and 14 files are duplicated more than 300 times.
- The empty file is present 8325 times in the archive, spread over 874 packages. This isn't surprising since it's used for all kinds of purposes like Python's init.py files, Perl's .bs files, etc. I also learned (among other oddities) that the python2.4-doc package ships a few zero-byte .png files. Uh?
- Also popular is the file with just one newline character in it: 343 occurrences. In the same vein, we have 461 occurrences of the "deny from all\n" file.
- Most of the hits are for Doxygen images in -dev and -doc packages, namely doxygen.png, tab_b.gif, tab_l.gif, etc (about 350 hits each). In the same category, gjdoc CSS files (149 hits).
- The partlibrary package is our worst offender. It ships a total of 9680 non-empty files, and only 4833 of them are unique. 6 files are duplicated more than 400 times each in the package.
If you're still reading so far, thanks! You can see the
report file, and the
program itself. If you want to run it, note that it's optimized for speed rather than memory efficiency; it runs in under 3 minutes but uses up to 1.5GB of memory (my home desktop has 4GB).
The grand conclusion is that all things considered, there is very little file duplication in the archive: the #1 duplicate represents less than 0.4% of the two million analyzed files, and it doesn't actually use any space since it's an empty file... :)