31 August 2007

Shared links for 2007-08-31

30 August 2007

Random fact of the day

Over the last five months I have orphaned or given away a total of 12 packages. It feels good.

27 August 2007

More fun with md5sums: collisions

During my experiments with file duplication in the Debian archive, it occurred to me that having a list of files with identical MD5 hashes was a good starting point for finding MD5 collisions in the archive. If any of those files were different but had the same hash as computed by MD5, I'd have a collision.

Unfortunately, checking if the files differ involves extracting the data tarball of each affected package and computing another hash for the files (I used SHA-1), which takes a while. 3h32m and 47GB of extracted files later, I now have the results and there is no MD5 collision in the Debian archive. The chances were slim if not null, but at least now I'm sure.

(In fact I did find one occurrence, but it turned out that the file's MD5 hash in DEBIAN/md5sums was incorrect, for some reason.)

Shared links for 2007-08-27

23 August 2007

Unravel

Today was a beautiful day.

22 August 2007

emacs-snapshot 20070822-1

emacs-snapshot is now taken from the CVS trunk rather than from the Emacs 22 release branch. You may want to put your current version on hold, or use the emacs22 packages from Debian. However, if you feel adventurous and choose to upgrade, you will have:

  • SVG support,
  • Tramp 2.1,
  • isearch in minibuffer history (addictive),
  • M-x kill-matching-buffers (but you should use Ibuffer anyway),
  • mb-depth.el,
  • other minor things.

Shared links for 2007-08-22

21 August 2007

A study of file duplication in the Debian archive

[Long post, sorry. If you're short on time skip to the end for the juicy parts.]

It all started about a week ago when I decided to find out how many files were unique in the whole Debian archive, and how many were duplicated in the same package or in other packages. I had done some work on duplication detection before, and I knew that the process involves getting some kind of hash value of every document, then finding duplicate values in the list of hashes.

I already had a local Debian mirror so the raw data was all there. I basically had two options: either compute the hash value of every file myself from the packages in my mirror, or find some other (ideally faster) way to determine if a file is unique. I quickly decided that the best option was to use the md5sums embedded in most Debian packages (in the DEBIAN/md5sums control file). That would give me MD5 hashes of all regular files, excluding conffiles.

So the first step was to check how many packages have embedded md5sums, and a simple script showed that less than 3% don't have them. This first check exposed a bug in python-debian, which was duly reported. Along the way, my post prompted a discussion on debian-devel about the state of md5sums, and I set up a daily check to keep track of things.

The next step was to make sure that all md5sum-enabled packages had usable information. It turned out that python-debian choked on 103 packages because of embedded spaces in filenames, and that it also had a blocker bug when used in Python 2.4. I filed two more bugs.

All that remained was the easiest part: write the program to find duplicate files. I did that, and I now have all kinds of funny statistics:

  • The dataset consists of 20170 packages with md5sums, shipping a total of 2069830 files. That gives an average ratio of 102 files per package, excluding conffiles.
  • There are 113732 duplicates in the archive. 1556 files are duplicated more than 10 times, and 14 files are duplicated more than 300 times.
  • The empty file is present 8325 times in the archive, spread over 874 packages. This isn't surprising since it's used for all kinds of purposes like Python's init.py files, Perl's .bs files, etc. I also learned (among other oddities) that the python2.4-doc package ships a few zero-byte .png files. Uh?
  • Also popular is the file with just one newline character in it: 343 occurrences. In the same vein, we have 461 occurrences of the "deny from all\n" file.
  • Most of the hits are for Doxygen images in -dev and -doc packages, namely doxygen.png, tab_b.gif, tab_l.gif, etc (about 350 hits each). In the same category, gjdoc CSS files (149 hits).
  • The partlibrary package is our worst offender. It ships a total of 9680 non-empty files, and only 4833 of them are unique. 6 files are duplicated more than 400 times each in the package.
If you're still reading so far, thanks! You can see the report file, and the program itself. If you want to run it, note that it's optimized for speed rather than memory efficiency; it runs in under 3 minutes but uses up to 1.5GB of memory (my home desktop has 4GB).

The grand conclusion is that all things considered, there is very little file duplication in the archive: the #1 duplicate represents less than 0.4% of the two million analyzed files, and it doesn't actually use any space since it's an empty file... :)

20 August 2007

Cornflakes Heroes: "Lifeline" video

Cornflakes Heroes (my sister's band) have released the video for their new single Lifeline, check it out:


(click here if the embedded player doesn't work)

19 August 2007

bzip2 compression in debs

During my previous adventures with the Debian archive, I found that two packages in the archive use bzip2 compression inside the .deb instead of the traditional gzip compression, so I decided to try it out on emacs-snapshot (one of my larger packages). The combined size of the deb files goes from 36764KB to 33880KB, a 2884KB (7.8%) difference. It also makes both lintian and linda unhappy, the former gives me the following error:

E: emacs-snapshot-nox: deb-data-member-wrongly-compressed
N:
N: The binary package contains a data member not compressed with gzip.
N: From dpkg-dev 1.11 on, you can configure the way the data tarball is
N: compressed. Though this is possible, you are not allowed to use it
N: before dpkg 1.11 (or later) enters stable.
and linda just bombs:
E: emacs-snapshot-common; Package uses a newer feature of dpkg.
This package uses a data.tar, or data.tar.bz2 member of the .deb. This
was introduced in dpkg 1.11, but is not allowed to be used until dpkg
1.11 or later hits stable.
File ...3_all.deb failed to process: Level 2 unpacking failed:
Could not unpack data tarball
Etch was released with dpkg 1.13.25, so bzip2 compression is probably allowed now. But is a 7.8% saving worth the incompatibility price? I'm not sure.

16 August 2007

Debian packages without md5sums

Random testing of my local Debian mirror shows that 644 binary packages out of 20774 (3.1%) are missing the DEBIAN/md5sums control file. This file (used by debsums) is very useful to check for disk corruption, and even though debsums can generate the data on the fly when a package is installed, it's better to have this information computed at build time.

So if your packages use debhelper, make sure that you have the proper dh_md5sums call in debian/rules. If you don't use debhelper, you'll have to generate the DEBIAN/md5sums control file manually (see dh_md5sums's source for inspiration). If you use a high-level build tool like CDBS, you probably don't have to do anything.

Should I file bugs against the 446 affected source packages? A few maintainers apparently exclude some binary packages on purpose; for example the zsh source package generates md5sums for zsh, zsh-static and zsh-doc, but not for zsh-dev and zsh-dbg...

Update: updated statistics with a run on a full amd64+all mirror.

15 August 2007

Redesign

Back to black and white, font sizes should be more rational, and there's a new blogroll. I got rid of the flickr badge, the hierarchical blogger archive and the blog tagline, although I may add the latter back since the blog name in itself isn't very descriptive.

I'm still partial to text-align: justify;, but I don't know for how long.

14 August 2007

Shared links for 2007-08-14

  • Scenes
    So true.
  • Increasing Virtualization Insanity
    "For sysadmin types this means: do what you have to do with Xen for now. But keep the investments small. For developers this means: don't let yourself be tied to a platform."

13 August 2007

Shared links for 2007-08-13

12 August 2007

I need to put you in your plaaaaaace

Productive week-end on the Debian front: 10 uploads including 3 new upstream versions, 3 newly orphaned packages and 2 packages converted to quilt.

Apart from that, one of my favorite movies. And sukiyaki.

11 August 2007

Awesome



Davey Dance Blog - 28 - NYC MTA - The Sunshine Underground - "Put You In Your Place" from Pheasant Plucker and Vimeo.

10 August 2007

Shared links for 2007-08-10

03 August 2007

Shared links for 2007-08-03

01 August 2007

Converting Debian packages from dpatch to quilt

I've been using quilt a lot at work recently (in a non-Debian environment), and I've been enjoying it very much. So much that I've decided to convert my Debian packages to it, from dpatch. Once you're used to quilt, using dpatch is almost literally painful. :)

There doesn't seem to be a ready-made guide on how to convert packages from dpatch to quilt, so here's how to do it painlessly in five easy steps:

  1. Install the quilt package (duh) and make sure that the QUILT_PATCHES variable is set to debian/patches in your shell environment. Or you can set it in your ~/.quiltrc file instead (see this post for other interesting settings).
  2. Delete the build dependency on dpatch from your debian/control file, replacing it with a dependency on quilt (>= 0.40).
  3. In your debian/rules file, include /usr/share/quilt/quilt.make instead of /usr/share/dpatch/dpatch.make.
  4. Convert all your dpatch files by using the following command from your package's top level:
    for p in $(dpatch list-all); do \
    quilt import -P $p.diff debian/patches/$p.dpatch; \
    quilt push; \
    done
  5. Delete dpatch files:
    rm -f debian/patches/00list debian/patches/*.dpatch
And that's pretty much it! You'll probably want to clean up the headers of your debian/patches/*.diff files since they'll still contain some dpatch markers. You can now build your package as usual. To edit a patch, use quilt push <patch>, quilt edit <file1> <file2>, finish with quilt refresh to save the patch.

For more information on how to use quilt, read the tutorial, it's in /usr/share/doc/quilt. You won't regret switching!