21 August 2007

A study of file duplication in the Debian archive

[Long post, sorry. If you're short on time skip to the end for the juicy parts.]

It all started about a week ago when I decided to find out how many files were unique in the whole Debian archive, and how many were duplicated in the same package or in other packages. I had done some work on duplication detection before, and I knew that the process involves getting some kind of hash value of every document, then finding duplicate values in the list of hashes.

I already had a local Debian mirror so the raw data was all there. I basically had two options: either compute the hash value of every file myself from the packages in my mirror, or find some other (ideally faster) way to determine if a file is unique. I quickly decided that the best option was to use the md5sums embedded in most Debian packages (in the DEBIAN/md5sums control file). That would give me MD5 hashes of all regular files, excluding conffiles.

So the first step was to check how many packages have embedded md5sums, and a simple script showed that less than 3% don't have them. This first check exposed a bug in python-debian, which was duly reported. Along the way, my post prompted a discussion on debian-devel about the state of md5sums, and I set up a daily check to keep track of things.

The next step was to make sure that all md5sum-enabled packages had usable information. It turned out that python-debian choked on 103 packages because of embedded spaces in filenames, and that it also had a blocker bug when used in Python 2.4. I filed two more bugs.

All that remained was the easiest part: write the program to find duplicate files. I did that, and I now have all kinds of funny statistics:

  • The dataset consists of 20170 packages with md5sums, shipping a total of 2069830 files. That gives an average ratio of 102 files per package, excluding conffiles.
  • There are 113732 duplicates in the archive. 1556 files are duplicated more than 10 times, and 14 files are duplicated more than 300 times.
  • The empty file is present 8325 times in the archive, spread over 874 packages. This isn't surprising since it's used for all kinds of purposes like Python's init.py files, Perl's .bs files, etc. I also learned (among other oddities) that the python2.4-doc package ships a few zero-byte .png files. Uh?
  • Also popular is the file with just one newline character in it: 343 occurrences. In the same vein, we have 461 occurrences of the "deny from all\n" file.
  • Most of the hits are for Doxygen images in -dev and -doc packages, namely doxygen.png, tab_b.gif, tab_l.gif, etc (about 350 hits each). In the same category, gjdoc CSS files (149 hits).
  • The partlibrary package is our worst offender. It ships a total of 9680 non-empty files, and only 4833 of them are unique. 6 files are duplicated more than 400 times each in the package.
If you're still reading so far, thanks! You can see the report file, and the program itself. If you want to run it, note that it's optimized for speed rather than memory efficiency; it runs in under 3 minutes but uses up to 1.5GB of memory (my home desktop has 4GB).

The grand conclusion is that all things considered, there is very little file duplication in the archive: the #1 duplicate represents less than 0.4% of the two million analyzed files, and it doesn't actually use any space since it's an empty file... :)

4 comments:

Anonymous said...

Well it certainly uses inodes ;)

Anonymous said...

I like all the statistics, but you left out the one I was most interested in seeing: the amount of wasted space due to duplication, both as an absolute value and as a % of the total archive. Although from your description of the the duplicated files, I'm guessing it will be low.

Romain Francoise said...

It would be easy to do, but it requires getting the sizes from the data tarball, which has a significant CPU cost (it must be uncompressed first).

sc0ttbeardsley said...

You might be interested in a similar project called CDR (checksums done right) that I'm involved with. It's basically tripwire using vendor checksums (but since most debs don't have sha256 we generate them ourselves). We use virtualization to verify a system while it is online.

We currently have support for deb and rpm and we're looking at windows support as well. Ubuntu mainly but it could easily be extended to Debian.

A few stats:

* Architectures i386 and amd64 only
* Centos 4,5 and Ubuntu dapper-gutsy
* 7,033,495 unique files (via sha256)
* 134,134 unique packages

We haven't done much with tracking maintainers but that's a good idea and we'll likely look into some sort of trust system based on the maintainer. More for revocation reasons that actual trust. If you'd like to contribute to collaborate email me (scott at cse ucdavis education).

Post a Comment