Blaisorblade programming thoughts: July 2010

Wednesday, July 28, 2010

Automatic race detection

I wanted to recommend a cool Google tool, based on Valgrind, which performs automatic race detection:

http://code.google.com/p/data-race-test/wiki/ThreadSanitizer

Also DRD and Helgrind exist (both part of Valgrind), but I've seen Helgrind considered "not very stable" more often than not (and it's still maybe a bit experimental). While ThreadSanitizer seems to be in active development and use, and it is supported by Google. And it even has Vim integration!
I remembered of this during this discussion about something similar to race detection, in a context halfway between Python and share-nothing concurrency: http://codespeak.net/pipermail/pypy-dev/2010q3/006037.html
I hope to get your comments on this tool.

Tuesday, July 27, 2010

Pretending you know what you're talking about - a book review

Some books in Computer Science seem to do just that: pretending they know what they are talking about, when they in fact do not.

A quote on this, not restricted to Java books:

Most Java books are written by people who couldn't get a job as a Java programmer (since programming almost always pays more than book writing; I know because I've done both). These books are full of errors, bad advice, and bad programs.

Peter Norvig, Director of Research at Google, Inc.

I have seen such books myself. But they do not get great reviews at Amazon and they do not get slashdotted.

Instead, Virtual Machine Design and Implementation in C/C++ does. Its author is great at pretending he knows what he is talking about. I started looking at it for my seminar on Virtual Machines, hoping to make some use of it, and I was deeply disappointed. Looking at the table of contents and index, Virtual Machines: Versatile Platforms for Systems and Processes seems to be a better book on the topic , which covers also system virtualization (a different topic having the same name), however I cannot really judge.
Virtual Machines, by Iain D. Craig, seems instead devoted to semantic issues, and I am not qualified to judge that topic, only to say it is different.
Back to the first book, after reading sample pages from Amazon preview (mostly the table of contents and the index), together with all Amazon reviews, I realized what is happening here.

So, the rest of this post is a (meta-?)review against this book - which is much less interesting, unless you were actually considering to buy it.

The author does know what he is talking about, and spent lots of time polishing it, he's just totally unaware that it is completely unrelated to the title of the book, since he has no experience in the field. Reading the introduction after reading the above quote is enlightening - the author mentions being poor (page xvii), and his experience (described at page iv), like writing CASE tools in Java, is totally unrelated - if he couldn't get a better job, I'd say Norvig's quote is exactly about him. And after reading this, the mention he makes about when "he used to disable paging and run on pure SDRAM" smells of a lamer wanting to show off (in other contexts, it could be just a joke, I know).

The author is just trying to learn by himself how to implement an emulator, and writing a diary on this. If you care about real Virtual Machines (Parrot, the JVM, .NET, and so on), you need entirely different material. Say, JIT compilation. Other reviews mention some more points which are missing, but none of them had a real introduction to the field, so they are not aware of how much else is not in the book. Actually, maybe he knows how a VM looks from outside, but his attempts to move in that direction (like support for multithreading) look quite clumsy - he talks about that and then does not manage to implement it.

Finally, the author seems to be an assembler programmer who is programming assembler in C++. As we remember, it is famously known that "the determined Real Programmer can write Fortran programs in any language", and it is still true with Assembler. Things like manual loop unrolling on debug statements (mentioned in reviews) are quite funny.

In the end, I'd recommend this book to nobody - it might contain some interesting stuff about the actual topic, asa acknowledged by some reviews, but I would not buy a book because of this hope. Especially, not for who cares about Virtual Machines.

Monday, July 26, 2010

JVM vs .NET CLR - a comparison between the VMs

I was reading a Java vs. F# comparison on this post, and it ended up comparing the JVM vs. the CLR. It also tries to counter some points of a post by Cliff Click, but it does so in a bad way. That's why I try here to improve on that comparison.

The three real limitations of the JVM, compared to the CLR, are the lack of:

tail calls (being addressed), which are important for functional languages running on top of the JVM, like Clojure and Scala
value types: I was happy to read on that post that there's interest on that, and I already read this on Guy Steele's Growing a Language.
proper generics, without erasure: problems were well-known when generics were introduced, but back-compatibility with binary libraries forced the current solution, so this one is not going to be solved.

As an additional, it seems to me that since Java and the JVM is not managed by a single company, but by the Java Community Process, addition of new features like these is much slower (but hopefully they are better designed).

Combining generics with value types would allow great memory (and thus performance) savings: one could define Pair as a value type and then use an ArrayList> as the backing storage for an hash table and pay no space overhead.

An additional point for dynamic languages, the lack of an invokedynamic primitive creates significant performance problems - for instance, Jython (a JITted language) is 2x slower than CPython (an interpreted language with a slow interpreter). Anecdotal evidence suggests that lack of support for inline caching is an important reason: namely, a reimplementation of Python on top of the JVM for a course project, which allowed inline caching, was much faster.

About JVM vs CLR JITs, discussing the quality of existing implementations: Cliff Click mentioned old anecdotal evidence of the CLR JIT being slower because .NET is geared towards static compilation and not so much effort has been put into it. I guess that Click refers to some optimizations which are missing from the CLR. For instance, at some point in the past replacing List (a class type) by IList (an interface type), in C#, caused a 10x slowdown that a good JIT compiler is able to remove, as discussed in Click's blog post. Dunno if this still holds.

Anyway, this comparison is about the JVM vs. the Microsoft CLR implementation, running only on Windows. Mono is available for Linux, but it uses a partially conservative garbage collector based on Boehm's GC, and this gives really inferior results in some cases. The Boehm's authors claim here that such cases can be easily fixed by changing the source program, but this is valid only when the application is written for this GC, not when you want to support programs written for an accurate GC.
Some evidence about this, where the described worst case is implemented, is available here.
In the end, if you want to run over Linux, you have no real choice if you aim for the best and most robust performance, currently (as also suggested by this entry of the Language Shootout, but read their notes about the flawedness of such comparisons). We have to hope for improvements in Mono.