Why git is so fast (or why Java is not as fast as C)

  • I find it interesting to read that every single one of his criticisms of JGit would not need to apply to a C# implementation. Unsigned types, unmanaged memory allocation (Marshal.Alloc), unsafe code with pointers, native P/Invoke access, value types (e.g. fixed arrays for his SHA-1 point), and an IDisposable convention for managing all the unmanaged trickery, C# has pretty much all the needed features.

    When optimizing C# for space use in particular (which ends up being a time optimization for I/O heavy loads), I've leaned heavily on arrays of structs, and even bit-packed arrays (e.g. an Int32 array storing 14-bit integers, packed so that they don't align on index boundaries).

    Programming at the level of C in C# loses most of the benefits, of course, but at least you can hide the optimizations behind pretty APIs, and write the non-critical parts in terms of an easier to work with lower layer.

  • Good article, but it seems to ignore the elephant in the room: the JVM's cold start times.

    For me, git's use case is being called from the command line, often interactively, between code editing sessions. It needs to start fast and finish fast, to not interrupt my workflow.

    The JVM's cold start times are huge. On the order of a second. Noticeably slow. So slow, that in the common case of committing a few files, the JVM startup time would totally dwarf the time spent actually doing work.

  • This has always been true and will continue to be true.

    It's possible to build a high level language that's as fast as C? Sure--but only if you restrict the programmer to the same amount of effort in both languages. If your application is one where it's worthwhile applying a great deal of extra programmer time in order to improve performance, a low level language will always win because it exposes more of the native machine.

    The goal of a high-performance high-level language should be to provide C-like performance for a reasonably unoptimized application. Once one starts to optimize a program down to the last instruction, staying on par with a low level language becomes simply impossible for a high level language.

    In my experience, every level of abstraction one creates away from the machine limits your performance by some amount. This can be demonstrated without even leaving assembly language!

    Fastest possible: raw assembly code in a NASM-like assembler. With this, you can write basically any code possible with no limitations, at the cost of extremely high programmer time costs.

    Shortcut: Use inline assembler instead of NASM to simplify calling convention and other niceties.

    Cost: There's now a whole bunch of stuff, like calling convention optimization and computed jumps, which you can no longer do.

    Shortcut: Use compiler intrinsics instead of raw assembly.

    Cost: You can no longer tweak your algorithm to minimize register spills because you aren't directly controlling spills anymore.

    Shortcut: Use a set of macros (like my project does) for handling calling convention, MMX/SSE abstraction, and other such simplifications.

    Cost: You tend to overlook optimizations that apply to one possible output of the abstraction and not others, resulting in either messes of ifdefs or suboptimal code--the former of which is of course violating the abstraction.

    Shortcut: Use a framework like liboil to write SIMD assembly instead of native code.

    Cost: By using generic SIMD operators, you lose access to specialized architecture-specific operations, along with the aforementioned issue of register spills.

    Here we haven't even gotten beyond assembler and we're already losing performance. Now scale this up to C and beyond: abstraction inherently comes at a performance cost. It isn't even merely a function of language: abstractions within a language reduce performance as well.

  • From the last line:

    "But, JGit performs reasonably well; well enough that we use internally at Google as a git server."

    Rings the 'good enough' bell to me.

  • In strictly language terms, C will always be faster than Java. However, I would dispute that one could say on a general basis that C programs will always be faster than Java programs.

    In fact, I would argue that all things being equal, the Java program is more likely to be faster (and yes, I realize there are a metric ton of potential caveats to what I just said). Why? Because Java allows you to focus on the "big picture" optimizations that really make all the difference.

    On the other hand, given an infinite amount of development time and experienced developers, the C program will likely be much faster. In some cases this is necessary. But for most cases, I personally would rather just ship something than try to squeeze every ounce of efficiency out of it.

  • Of the three issues mentioned in the post Java's lack of value types is the important one in my view. That's what causes Java programs to use hugely more memory than C, C++, C# or Go programs. Using more memory translates into an orders of magnitude drop in performance for memory intensive applications.

  • They use the word "high-level languages" and then use Java as the example. This is going to lead to some very wrong conclusions; the problems listed in the article are problems with Java, not problems with high-level languages.

    I imagine a Git-in-Haskell would be very close in performance to the C git. (Then why is Darcs so slow? Because it uses an icky imperative, mutable model, whereas git uses a immutable functional model.)

  • While I'd never claim Java is comparable to C, I'd like to note that the C codebase has, by their admission, 4 years of work in it. I'd be surprised if the JGit codebase isn't much faster in 3/4 years.

  • One thing I have been wondering as a non-git user: is git really CPU-bound in most operations? Maybe someone who is a regular git user can answer me this question. If it's not, I wouldn't expect Java to matter too much to its performance.

  • Im not wholly convinced that version control is an area where microsecond speed advantages matter. Rather versatility, portability and compatibility :)

  • >> "why Java is not as fast as C"

    Not actually true in reality. Modern JVMs can make on the fly optimizations based on the runtime profile, which would need to be done by hand in C.

    As said elsewhere though, Java excels when used for long running tasks - servers - backends etc where it can optimize for the long term. It doesn't excel when you try and start up the jvm loads of times for quick individual jobs.

    I don't understand why people are optimizing source control :/ fast vs fast meh I don't think it's an issue really.