More evidence for problems in VM warmup
Note that certain things can cause your code to be un-jitted. If you run into a NPE, for example, what actually happens is the compiled code segfaults. Adding null checks everywhere would be slow, so instead the JVM lets it blow up. When the jitted code segfaults that bytecode goes back into interpreted mode and has to be compiled again w/ updated heuristics.
Still, covering cases like this is easier than writing C. :)
It's interesting how often the same stories play out for different interpreter implementations. Everyone tries:
- a multi-pass JIT
- interpretting the input directly to reduce time to first op
- making the fastest transpiler they can and skipping the interpreter
All of these are addressing different constraints and affect each other. For instance, a cheap transpiler is only slightly slower than an interpreter loop, and allows you to move the threshold for the JIT farther to the right. If you can avoid trying to optimize things that you will only be slightly successful at, you can invest more of your CPU budget in deeper optimization on the hottest paths. You are also running on-stack replacement less often, and in fewer scenarios, which may mean you make different tradeoffs there as well.
If I were the King of the Forest, our routing fabric would utilize something like nginx's ramping weight feature to throttle new servers for a few minutes before they reached full membership in the cluster.
As things are we end up running something more like blue-green deployments and hit the dormant side with a stress testing tool. We haven't really come up with a better solution, though we have steadily reduced both the time necessary to warm up the servers and the worst case behavior if you accidentally skip the warming step. Today you will just have a very bad time. Originally circuits would blow like crazy.
I appreciate that these issues definitely complicate research and benchmarking of VM performance, but I'm confused about this question: "If you're a normal user, the results suggest that you're often not getting the performance you expect". Is that true? Most VMs are not simple, predictable systems that would ever reach a "steady state"—they're dynamic systems that are constantly executing different parts of the codebase and taking wildly different codepaths at different times. V8 is a great example of this—the type of code executed on a page changes wildly depending on what you're doing and what actions you take. Why would we even want to optimize it for "reaching a steady state", when different parts of the codebase may be more or less useful at different times? It seems more important to me to work on optimizations that can allow us to deoptimize parts of the codebase that we don't think we will be useful again, to save on memory, even if it involves sacrificing this theoretical notion of a "steady state"
For this research to be useful, the authors should switch to large benchmarks.
VMs win in the average. Therefore, VMs have great cases, average cases, and crappy cases. The crappy cases will always exist, and that’s known to VM architects and it’s not a bug.
Essentially this work is like criticizing a professional gambler for his biggest losses when the gambler is ahead in the average (or conversely praising him for his biggest wins when they’re behind in the average).
Source: I build VMs for a living.
> Here's an example of a "good" benchmark from our dataset which starts slow and hits a steady state of peak performance
Am I misinterpreting their graph? The difference between "slow" and "peak performance" seems to be a factor of about 1.005, so a whopping 0.5% improvement after warmup?
A good blog post has lots of hard parts (layout, scope, visuals, audience). Laurence Tratt, you nailed it for me! I loved the details you put in "Benchmarking methodology" and the clean layout with instructive visuals.
The best warmup is to have a long-running VM without memory leaks but with long-term stats, PGO-directed JIT similar to HotSpot.