- Python 3000 alpha 1 Released!
And Debian still has good ol' Python 2.4... - Thoughts after one year on the road
31 August 2007
Shared links for 2007-08-31
18:12
in
links,
planet
-- 0
comments
30 August 2007
Random fact of the day
Over the last five months I have orphaned or given away a total of 12 packages. It feels good.
18:56
in
debian,
planet,
tech
-- 0
comments
27 August 2007
More fun with md5sums: collisions
During my experiments with file duplication in the Debian archive, it occurred to me that having a list of files with identical MD5 hashes was a good starting point for finding MD5 collisions in the archive. If any of those files were different but had the same hash as computed by MD5, I'd have a collision.
Unfortunately, checking if the files differ involves extracting the data tarball of each affected package and computing another hash for the files (I used SHA-1), which takes a while. 3h32m and 47GB of extracted files later, I now have the results and there is no MD5 collision in the Debian archive. The chances were slim if not null, but at least now I'm sure.
(In fact I did find one occurrence, but it turned out that the file's MD5 hash in DEBIAN/md5sums was incorrect, for some reason.)
17:39
in
debian,
planet,
tech
-- 3
comments
23 August 2007
22 August 2007
emacs-snapshot 20070822-1
emacs-snapshot is now taken from the CVS trunk rather than from the Emacs 22 release branch. You may want to put your current version on hold, or use the emacs22 packages from Debian. However, if you feel adventurous and choose to upgrade, you will have:
- SVG support,
- Tramp 2.1,
- isearch in minibuffer history (addictive),
- M-x kill-matching-buffers (but you should use Ibuffer anyway),
- mb-depth.el,
- other minor things.
19:16
in
planet,
tech
-- 6
comments
21 August 2007
A study of file duplication in the Debian archive
[Long post, sorry. If you're short on time skip to the end for the juicy parts.]
It all started about a week ago when I decided to find out how many files were unique in the whole Debian archive, and how many were duplicated in the same package or in other packages. I had done some work on duplication detection before, and I knew that the process involves getting some kind of hash value of every document, then finding duplicate values in the list of hashes.
I already had a local Debian mirror so the raw data was all there. I basically had two options: either compute the hash value of every file myself from the packages in my mirror, or find some other (ideally faster) way to determine if a file is unique. I quickly decided that the best option was to use the md5sums embedded in most Debian packages (in the DEBIAN/md5sums control file). That would give me MD5 hashes of all regular files, excluding conffiles.
So the first step was to check how many packages have embedded md5sums, and a simple script showed that less than 3% don't have them. This first check exposed a bug in python-debian, which was duly reported. Along the way, my post prompted a discussion on debian-devel about the state of md5sums, and I set up a daily check to keep track of things.
The next step was to make sure that all md5sum-enabled packages had usable information. It turned out that python-debian choked on 103 packages because of embedded spaces in filenames, and that it also had a blocker bug when used in Python 2.4. I filed two more bugs.
All that remained was the easiest part: write the program to find duplicate files. I did that, and I now have all kinds of funny statistics:
- The dataset consists of 20170 packages with md5sums, shipping a total of 2069830 files. That gives an average ratio of 102 files per package, excluding conffiles.
- There are 113732 duplicates in the archive. 1556 files are duplicated more than 10 times, and 14 files are duplicated more than 300 times.
- The empty file is present 8325 times in the archive, spread over 874 packages. This isn't surprising since it's used for all kinds of purposes like Python's init.py files, Perl's .bs files, etc. I also learned (among other oddities) that the python2.4-doc package ships a few zero-byte .png files. Uh?
- Also popular is the file with just one newline character in it: 343 occurrences. In the same vein, we have 461 occurrences of the "deny from all\n" file.
- Most of the hits are for Doxygen images in -dev and -doc packages, namely doxygen.png, tab_b.gif, tab_l.gif, etc (about 350 hits each). In the same category, gjdoc CSS files (149 hits).
- The partlibrary package is our worst offender. It ships a total of 9680 non-empty files, and only 4833 of them are unique. 6 files are duplicated more than 400 times each in the package.
The grand conclusion is that all things considered, there is very little file duplication in the archive: the #1 duplicate represents less than 0.4% of the two million analyzed files, and it doesn't actually use any space since it's an empty file... :)
19:09
in
debian,
planet,
tech
-- 4
comments
20 August 2007
Cornflakes Heroes: "Lifeline" video
Cornflakes Heroes (my sister's band) have released the video for their new single Lifeline, check it out:
(click here if the embedded player doesn't work)
16:38
in
life,
planet
-- 0
comments
19 August 2007
bzip2 compression in debs
During my previous adventures with the Debian archive, I found that two packages in the archive use bzip2 compression inside the .deb instead of the traditional gzip compression, so I decided to try it out on emacs-snapshot (one of my larger packages). The combined size of the deb files goes from 36764KB to 33880KB, a 2884KB (7.8%) difference. It also makes both lintian and linda unhappy, the former gives me the following error:
E: emacs-snapshot-nox: deb-data-member-wrongly-compressedand linda just bombs:
N:
N: The binary package contains a data member not compressed with gzip.
N: From dpkg-dev 1.11 on, you can configure the way the data tarball is
N: compressed. Though this is possible, you are not allowed to use it
N: before dpkg 1.11 (or later) enters stable.
E: emacs-snapshot-common; Package uses a newer feature of dpkg.Etch was released with dpkg 1.13.25, so bzip2 compression is probably allowed now. But is a 7.8% saving worth the incompatibility price? I'm not sure.
This package uses a data.tar, or data.tar.bz2 member of the .deb. This
was introduced in dpkg 1.11, but is not allowed to be used until dpkg
1.11 or later hits stable.
File ...3_all.deb failed to process: Level 2 unpacking failed:
Could not unpack data tarball
17:15
in
debian,
planet,
tech
-- 5
comments
16 August 2007
Debian packages without md5sums
Random testing of my local Debian mirror shows that 644 binary packages out of 20774 (3.1%) are missing the DEBIAN/md5sums control file. This file (used by debsums) is very useful to check for disk corruption, and even though debsums can generate the data on the fly when a package is installed, it's better to have this information computed at build time.
So if your packages use debhelper, make sure that you have the proper dh_md5sums call in debian/rules. If you don't use debhelper, you'll have to generate the DEBIAN/md5sums control file manually (see dh_md5sums's source for inspiration). If you use a high-level build tool like CDBS, you probably don't have to do anything.
Should I file bugs against the 446 affected source packages? A few maintainers apparently exclude some binary packages on purpose; for example the zsh source package generates md5sums for zsh, zsh-static and zsh-doc, but not for zsh-dev and zsh-dbg...
Update: updated statistics with a run on a full amd64+all mirror.
21:54
in
debian,
planet,
tech
-- 6
comments
15 August 2007
Redesign
Back to black and white, font sizes should be more rational, and there's a new blogroll. I got rid of the flickr badge, the hierarchical blogger archive and the blog tagline, although I may add the latter back since the blog name in itself isn't very descriptive.
I'm still partial to text-align: justify;, but I don't know for how long.
14:16
in
meta
-- 0
comments
14 August 2007
Shared links for 2007-08-14
- Scenes
So true. - Increasing Virtualization Insanity
"For sysadmin types this means: do what you have to do with Xen for now. But keep the investments small. For developers this means: don't let yourself be tied to a platform."
22:46
in
links,
planet
-- 0
comments
13 August 2007
Shared links for 2007-08-13
- The Process Process
I wish I didn't know what this comic is about, but unfortunately, I do. - Geeky Matryoshka Dolls: From Bit to Terabyte
20:58
in
links,
planet
-- 0
comments
12 August 2007
I need to put you in your plaaaaaace
Productive week-end on the Debian front: 10 uploads including 3 new upstream versions, 3 newly orphaned packages and 2 packages converted to quilt.
Apart from that, one of my favorite movies. And sukiyaki.
19:42
in
life,
planet,
tech
-- 0
comments
11 August 2007
Awesome
Davey Dance Blog - 28 - NYC MTA - The Sunshine Underground - "Put You In Your Place" from Pheasant Plucker and Vimeo.
21:56
in
life
-- 1 comments
10 August 2007
03 August 2007
Shared links for 2007-08-03
- Waiting for Omer
Every day Iraq sinks a little bit deeper into horror... - On the Other Hand: The Flip Side of Entrepreneurship
"In the early days, start-ups focus on how great it’s going to be when they succeed; but the moment they do, they start talking about how great it was before they did." - elevator camera obscura
Neat, check out the video. - Mozilla Says “Ten Fucking Days”
- Your browser is a tcp/ip relay
- Packet Geeks Gone WWWild
Lots of interesting stuff going on at Blackhat 2007, wish I had been there.
20:22
in
links,
planet
-- 0
comments
01 August 2007
Converting Debian packages from dpatch to quilt
I've been using quilt a lot at work recently (in a non-Debian environment), and I've been enjoying it very much. So much that I've decided to convert my Debian packages to it, from dpatch. Once you're used to quilt, using dpatch is almost literally painful. :)
There doesn't seem to be a ready-made guide on how to convert packages from dpatch to quilt, so here's how to do it painlessly in five easy steps:
- Install the quilt package (duh) and make sure that the QUILT_PATCHES variable is set to debian/patches in your shell environment. Or you can set it in your ~/.quiltrc file instead (see this post for other interesting settings).
- Delete the build dependency on dpatch from your debian/control file, replacing it with a dependency on quilt (>= 0.40).
- In your debian/rules file, include /usr/share/quilt/quilt.make instead of /usr/share/dpatch/dpatch.make.
- Convert all your dpatch files by using the following command from your package's top level:
for p in $(dpatch list-all); do \
quilt import -P $p.diff debian/patches/$p.dpatch; \
quilt push; \
done - Delete dpatch files:
rm -f debian/patches/00list debian/patches/*.dpatch
For more information on how to use quilt, read the tutorial, it's in /usr/share/doc/quilt. You won't regret switching!
19:20
in
planet,
tech
-- 0
comments