Yesterday's night, I packaged Vino, a VNC server that integrates seamlessly with GNOME. After creating the package, I ran vino-preferences and saw it crash with the following assertion:
assertion "next != 0" failed: file
"/home/jmmv/NetBSD/src/lib/libpthread/pthread_run.c", line 130, function "pthread__next"
Hmm, threading problems... so I started looking at Vino's code to see where the problem could be. Saw it was using threads, but it was not after half an hour or so until I noticed that they were disabled by default. Oh, well. Then I wondered "what? threading problems and the program is not using them?".
So I ran the program under GNU GDB ("why didn't he do this earlier?" you say... well, I don't know) and found:
#9 0x48315fe5 in IA__gtk_image_set_from_file (image=0x80eca80, filename=0x8103b00 "/usr/pkg/share/icons/Nuvola/scalable/apps/gnome-lockscreen.svg") at gtkimage.c:842
#10 0x0804c1e3 in vino_preferences_dialog_setup_icons (dialog=0xbfbfea34) at vino-preferences.c:675
Hm... opened the "Theme selector", switched to another theme (Wasp), and voila! The problem had gone away. So the next logical step was to try to move the offending icon from Nuvola to Wasp and try again. Oops, it crashed. "Is the icon corrupt?" I thought. To verify, I tried to replace it with multiple other ones (all of them in SVG format), and it kept crashing. Wow! Here is where it started to get interesting, because the problem was located somewhere deep in the dependency tree. (Am I a bug addict? ;-)
After almost an extra hour of debugging, I isolated the problem: a deadlock in gdk-pixbuf. This library is modular, in the sense that image loaders are "external" to it. Some of them are thread-safe (according to some property in their header), but others aren't. In this case, SVG is not, while PNG is (this is why an SVG icon made it crash the program and a PNG one did not). When the loader is not thread-safe, gdk-pixbuf ensures exclusive mutual access to its functions... and this is where the problem relied.
Vino uses the gtk_image_set_from_file function (as seen in frame #9 above), which in turn calls gdk_pixbuf_animation_new_from_file. This function acquires a global lock using the _gdk_pixbuf_lock function and then calls _gdk_pixbuf_generic_image_load. This other function also calls _gdk_pixbuf_lock... there you have it, the deadlock (POSIX threads do not have nested locking as Java does).
I fixed this issue by narrowing the critical region in the gdk_pixbuf_animation_new_from_file function. And this morning I decided to submit this back to the authors (while GNOME's anonymous CVS was not working for me), in bug #162999.
When the CVS server came back online... I saw that the problem had already been fixed in HEAD! Heh, well. I spent quite a bit of time filing the bug that I'd have avoided... but, anyway, I'd still have spent a lot of time searching for the problem.