2010-03-04

A space-efficient alternative to db and cdb

Many applications require a key,value database, but the data itself is (mostly) constant. General purpose hash databases are inefficient for this purpose as they waste a lot of space and often also require more complex queries. The most well-known constant database is from DJB, but it is a number of limitations.The biggest issues of DJB's implementation concern the large overhead (24 Bytes / record) and the inefficient hash function. Bob Jenkins has a comparison.

To work around this, I started to design a new constant database format. As a building block I used the CHM algorithm for a minimal perfect hash function. An implementation for that already existed in nbperf(1). I decided to use a separate offset table as it provides a natural encoding of the entry size and makes it easier to allow multiple keys for the same value.

To minimize the storage I decided against handling key matching in the library. The pay load for many of the intended uses already contains the key, so storing it separately would waste space. The perfect hashing ensures that only a single entry has to be matched, so the overhead is generally small, too.

The result is quite impressive. For the /etc/service database, the cdb version is 304KB, the original db(3) database 2.1MB (4.2MB file size, but sparse). The run time of services_mkdb dropped from 2.2s before to 0.24s afterwards. For terminfo, the cdb version is 1.3MB compared to 2.1MB for db(3) (2.8MB file size, but sparse). The run time of tic -x dropped from 1.5s to 0.4s.

Code size is quite small as well, the reader itself is 5.2KB code and the writer 15KB. Compiled for AMD64, the reader is 1.2KB and the writer 3.6KB.

2010-03-01

New Developers NetBSD.org News 2010-03-01 00:00 UTC

2010-02-27

The NetBSD runtime linker now has a negative symbol cache. In a nutshell, this has reduced the startup time of the Evolution mail client from around 5 minutes to 3 seconds on my QuadCore amd64 machine. Not many applications have a lot of plugins with a large amount of links to external libraries and I doubt many other applications will gain such a drastic speed bump, but the GNOME desktop as a whole now loads small bit quicker. I would imagine that KDE will now load faster as well.

2010-02-26

Hackathon 13 Results

Update! I'm editing the text of this post because it makes me sound a lot more important than I was. :)

I managed to participate a little bit in the hackathon, replied to a few PR's trying to get some feedback/closure, so I'm pleased.

Thanks to HUGE efforts by hardcore developers, GNATs shows a really great result!

GNATS Bug Database Summary
State Count (before) Count (after)
open 4418 4322
analyzed 155 155
feedback 200 225
suspended 61 60
dead 6 11
closed 35081 35180
TOTAL 39921

2010-02-25

What is left before USE_DESTDIR=yes can be the default for everyone?

DESTDIR support is one of the biggest internal changes since the introduction of package barrier and it has some user visible consequences.
What is left to do before it can be made the default to keep the impact as small as possible?

  • Do another pass to ensure that the unprivileged user doesn't leak into the binary package in a functional role. This is hard to enforce automatically as a large number of packages leak it e.g. to annotate who build the package (compiled Emacs modules for example).
  • Make it possible for USERGROUP_PHASE to be set to install, so that it is invoked as su-target from pre-install. I'm not sure how to best hunt down the packages that need this. Building a list of packages that use PKG_USERS or PKG_GROUPS is not difficult, but how to check which of those fail? The best idea so far is a modified client-clean script for pbulk, that sets /etc/passwd and /etc/group to a prestine state and see what fails.
  • Add an option for pbulk to allow unprivileged bulk builds without calling su. At the same time, make su actually configurable. An idea is to add a check in pbulk.conf for `whoami` = "root" and set su to "shift; exec $SHELL $*" .
  • Add a warning about DEPENDS_TARGET=package / UPDATE_TARGET=package. This is something where I am definitely willing to break the interface.

A new name for a build phase, a new behavior for DESTDIR users and some black magic.

When the DESTDIR support was originally introduced 3.5 years ago, I tried to keep it as non-intrusive in the infrastructure as possible.
The reason was simply, that it is very easy to introduce bugs in the complex make rules and DESTDIR support would need quite a bit of work before it is ready for main stream consumption. My initial goal for 2006Q4 was 50% or so I wrote in my EuroBSDCon slides. I was waaaay too optimistic.

Fast forward 3 years to the current time. DESTDIR support is now supported by almost all packages. 384 package locations without DESTDIR support remains (compared to 8302 with). To avoid unnecessary regressions from developers not testing it, I made USE_DESTDIR=yes the default for PKG_DEVELOPER a while ago. That's when the complains and insults started.

Due to the change, "make install" now suddenly doesn't modify /usr/pkg any longer. The other direct consequence is that "DEPENDS_TARGET=package" breaks as well.
The former is reasonable to fix by finally pushing the originally planed intrusive changes. The latter is more difficult to do without second guessing and more a case of Update your config, please. So, what was needed to change the install target to behave the same from a user perspective?

First of all, a bit of search and replace. Second, dealing with the surprises. Just renaming install to stage-install -- fails. The install-vars target has be renamed as well, as the necessary files are created. More fun was the addition of the new target. Just adding

.PHONY: install
.if ${_USE_DESTDIR} == "no"
install: stage-install
.else
install: package-install
.endif

...doesn't work. It surprisingly does nothing. What happened?

pkgsrc computes a number of variables based on the dependencies. For example, BUILDLINK_PREFIX depends on whether the package is builtin and if not, where it is found. This got quite expensive when pkgviews was introduced. A long time ago, pkgsrc therefore run a new make instance for every major build phase (depends, extract, build, install). This was changed later with the introduction of the pkgsrc barrier. Essentially, the build phases are separated into pre-barrier and post-barrier phases, with a separate make invocation in between.

How does this affect the install target above? install has to be a post-barrier operation otherwise it ends up effectively invocing build, but not the desired install rule.
Black magic. Why do I know? The same problem occured with package-install a long time ago...

pbulk includes logic to avoid rebuilding packages that haven't changed based on the RCS ID ($NetBSD$).
A look at some problem cases.

pbulk contains a script to determine whether a package has to be rebuilt or not. There are three checks performed by default:

  • The RCS IDs recorded in +BUILD_INFO match the RCS IDs in the pkgsrc tree.
  • The names of the packages required for the build match the recorded set in the package.
  • The package is newer than all the packages required for the build.

The third condition is automatically valid after a bulk build, otherwise the system has time keeping or file system issues. The other conditions are more interesting.

The first condition can trigger a permanent rebuild if files are formed in a way that the +BUILD_INFO processor extracts incomplete or additional RCS IDs. The rdigest package for example had a patch starting with

@@ -41,8 +41,17 @@ __RCSID("$NetBSD: digest.c,v 1.15 2007/0

As the RCS IDs are extracted with a simple grep expression, this ended up overwriting the original RCS ID at the top.

Another example was url2pg, which contained the following statement in a local file:

      print PLIST ("\@comment \$NetBSD\$\n");

This is picked up too and should be escape, in this case by splitting the string into two parts in the middle of NetBSD.

Interestingly, the second condition is violated in one case as well. p5-DBIx-Class-EncodedColumn depends on p5-Digest-SHA. The version of p5-Digest-SHA is higher than the version of the Perl package, so the dependency resolver picks up that. During the build, pkgsrc decides that perl is already present and good enough, so skips the dependency. I'm not sure how to best address this yet.

2010-02-01

New Developers NetBSD.org News 2010-02-01 00:00 UTC

Feeds

Feed RSS Last fetched Next fetched after
#NetBSD Community Blog XML 2010-03-11 11:45 2010-03-11 13:45
bsdtalk XML 2010-03-11 11:45 2010-03-11 13:45
freshmeat.net Releases XML 2010-03-11 11:45 2010-03-11 13:45
hubertf's NetBSD blog XML 2010-03-11 11:45 2010-03-11 13:45
i summon one kim XML 2010-03-11 11:45 2010-03-11 13:45
Implementality XML 2010-03-11 11:45 2010-03-11 13:45
Jeremy C. Reed's blog XML 2010-03-11 11:45 2010-03-11 13:45
Matthew Sporleder's website XML 2010-03-11 11:45 2010-03-11 13:45
NetBSD Blog XML 2010-03-11 11:45 2010-03-11 13:45
NetBSD PXE Bulk Install Project XML 2010-03-11 11:45 2010-03-11 13:45
NetBSD.org News XML 2010-03-11 11:45 2010-03-11 13:45
Nifelheim Tech-Blog XML 2010-03-11 11:45 2010-03-11 13:45
OSNews XML 2010-03-11 11:45 2010-03-11 13:45
Ours & Hippy — le blog XML 2010-03-11 11:45 2010-03-11 13:45
Seebach Exhibit 7 XML 2010-03-11 11:45 2010-03-11 13:45
The Julipedia: Blog XML 2010-03-11 11:45 2010-03-11 13:45
unsigned long geek = random(); XML 2010-03-11 11:45 2010-03-11 13:45
What Do You Want? XML 2010-03-11 11:45 2010-03-11 13:45
www.sonnenberger.org Blog: All XML 2010-03-11 11:45 2010-03-11 13:45