Interferences in CVS tagging

Once again, CVS shows its weaknesses. Last night I committed a fix to pkgsrc and soon after I noticed I had a prior e-mail by Alistair, a member of PMC and the one responsible for the preparation of pkgsrc releases, asking developers to stop committing to the tree because he was going to tag it for pkgsrc-2007Q4. It turns out that my fix did not get into the branch because the directory it went in (devel/monotone) had already been tagged. Had I committed the fix to, say, x11/zenity, it would have gone into the branch. Or worse, had I committed a fix that spanned multiple files, some of them would have got to the branch and others not. So what, am I supposed to read e-mail before I can do a commit? What if the mail does not arrive on time? What if the commit had affected many more directories and some of them had already been tagged but some not? This is just another example of CVS showing its limitations and stupidities. Given that each file's history is stored independently — i.e. there are no global changesets — the only way to tag the repository is to go file by file and set the tag on each. And then, you need to check which revision of each file is the one to be tagged. I do not know why is this so slow even when you do a rtag (so the one doing the work is the server alone) on HEAD, but in the case of pkgsrc this process took more than 2 hours! OK, OK, I'm hiding the truth. The thing is there are some ways around this: for example, using the tag command will tag the exact revisions you have in your working copy, or passing a date to rtag will tag the repository based on the provided timestamp. This way you ensure that the tagging process will be consistent even if people keep committing changes to the tree. However, the first of these commands will require a lot of network communication and the second will put a lot of stress on the server, making the command even slower (or that's what I've been told). In virtually all other version control systems that support changesets, a tag is just a name for a given revision identifier. And defining this tag is a trivial and quick process. Well, Subversion is rather different because tags are just copies of the tree, but I think that they deal with these efficiently.

January 3, 2008 · Tags: <a href="/tags/cvs">cvs</a>
Continue reading (about 2 minutes)

CVS and fragmentation

First of all, happy new year to everybody! I've recently got a MacBook Pro and, while this little machine is great overall, the 5400 RPM hard disk is a noticeable performance bottleneck. Many people I've talked to say that the difference from 5400 to 7200 RPM should not be noticeable because:These 2.5-inch drives use perpendicular recording, hence storing data with a higher bit density. This means that, theorically, they can read/write data more quickly achieving speeds similar to 7200 RPM drives.Modern file systems prevent fragmentation, as described here for HFS+.To me, these two reasons are valid as long as you manage large files: the file system will try to keep them physically close and the disk will be able to transfer sequential data fairly quickly. But unfortunately, these ideas break when you have to deal with thousands of tiny files around (or when you flood the drive with requests from different applications, but this is not what I want to talk about today). The easiest way to demonstrate this is to use CVS to manage a copy of pkgsrc on such drives. Let's start by checking out a fresh copy of pkgsrc from the CVS repository. As long as the file system has a lot of free space (and has not been "polluted" by erased files), this will run quite fast because it will store all new files physically close (theorically in consecutive cylinders). Hence, we take advantage of the higher bit densities and the file system's file allocation policy. Just after the check out operation (or unarchiving of a tarball of the tree), run an update (cvs -z3 -q update -dP) and write down the amount of time it takes. In my specific tests, the update took around 5 minutes, which is a good measure; in fact, it is almost the same I got in my desktop machine with a 7200 RPM disk. Now start using pkgsrc by building a "big" package; I've been doing tests with mencoder, which has a bunch of dependencies and boost, which installs a ton of files. The object files generated during the builds, as well as the resulting files, will be physically stored "after" pkgsrc. It is likely that there will be "holes" in the disk because you'll be removing the work directories but not the installed files, which will result in a lot of files stored non-contiguously. To make things worse, keep using your machine for a couple of days. Then, do another update of the whole tree. In my specific tests, the process now takes around 10 minutes. Yes, it has doubled the original measure. This problem was also present with faster disks, but not as noticeable. But do we have to blame the drive for such a slowdown or maybe, just maybe, it is CVS's fault? The pkgsrc repository contains lots of empty directories that were once populated. However, CVS does not handle such entries very well. During an update, CVS recreates these empty directories locally and, at the end of the process, it erases them provided that you passed the -P (prune) option. Furthermore, every such directory will end up consuming, at least, 5 inodes on the local disk because it will contain a CVS control directory (which typically stores 3 tiny files). This continuous creation and deletion of directories and files fragment the original tree by spreading the updated files all around. Sincerely, I don't know why CVS works like this (anyone?), but I bet that switching to a superior VCS could mitigate this problem. A temporary solution can be the usage of disk images, holding each source tree individually and keeping its total size as tight as possible. This way one can expect the image to be permanently stored in a contiguous disk area. Oh, and by the way: Boot Camp really suffers from the slow drive because it creates the Windows partition at the end of the disk; that is, its inner part, which typically has slower access times. (Well, I'm not sure if it'd make any difference if the partition was created at the beginning.) Launching a game such as Half-Life 2 takes forever; fortunately, when it is up it is fast enough. Update (January 9th): As "r." kindly points out, the slower part of the disk is the inner one, not the outer one as I had previously written (had a lapsus because CDs are written the other way around). And the reason is this: current disks use Zone Bit Recording (ZBR), a technique that fits a different amount of sectors depeding on the track's length. Hence, outer (longer) tracks have more sectors allocated to them and can transfer more data in a single disk rotation.

January 7, 2007 · Tags: <a href="/tags/cvs">cvs</a>, <a href="/tags/fragmentation">fragmentation</a>, <a href="/tags/performance">performance</a>
Continue reading (about 4 minutes)