This Is Why They Call It a Weakly-Ordered CPU

  • Nice blog post, though I personally prefer the ridiculousfish post [0] he links to in the end, that one's an instant classic.

    He mentions Windows/x86 a couple of times. I only wish it were as simple as "this platform does not reorder." Having done low-level, heavily-multithreaded work on Windows for years: it'll behave like a strongly-ordered architecture 999 times out of a 1000 (or more). Then it'll bite you in the ass and so something unexpected. Basically, if you're doing your own synchronization primitives on x86, you have to pretty much rely on visual/theoretical verification because tests won't error out w/ enough consistency. I've run a test (trying to get away w/ not using certain acquire/release semantics) for an entire week to have it error out only at the last second (x86_64). Other times, I've shipped code that's been tested and vetted inside out for months, only to have the weirdest bug reports 3 or 4 months down the line in the most sporadic cases.

    0: http://ridiculousfish.com/blog/posts/barrier.html

  • I'm oddly uncomfortable with this article. It reinforces the idea of Memory Ordering as voodoo, rather than as something that can (and needs to be!) understood to properly write low level multicore code. Neither it nor the linked articles go into any details of how memory and cores actually interact, and without these details it would be very hard to get from "this seems to work" to "this is bug free".

    You can try running the sample application on any Windows, MacOS or Linux machine with a multicore x86/64 CPU, but unless the compiler performs reordering on specific instructions, you'll never witness memory reordering at runtime.

    It may just be poor wording, but I don't think this sentence makes sense -- it conflates compiler optimizations with memory reordering, and implies that this is dependent of choice of operating system. While the author probably didn't mean this, it's clear from some of the comments in this thread that this is causing confusion to readers. Worse, it's just not true --- while this particular example might not cause problems, memory reordering is still an issue that needs to be dealt with on x86.

    Analogies can be helpful for intuition, but I think this is a case where one really needs to understand what's happening under the hood. Treating the CPU as a black box is not a good idea here, and test-driven development is probably not a good approach to writing mutexes. Calling attention to the issue is great, but this is an area where you really want to know what exactly guarantees your processor provides, rather than trying things until you find something that seems to work.

  • This is why I don't like to use shared memory. It's not easy to do this for a variety of reasons.

    At a low level, to try and make this work, you need to do more than worry about a mutex. You need the cpu's cache to be out the way, the memory area protected, AND the memory bus transactions to be completed!

    So...if c++11 works, this is what it must really do(some of this is handled by the hardware, but these all have to happen...and if there's a hardware bug, you need a software workaround):

    1) Lock the memory area to the writing cpu (this could be a mutex with a memory range, but safest, and slowest, is to disable interrupts while you dick with memory. That's unlikely to be available at high level).

    2) Write the memory through the cache to the actual memory address OR track the dirty bit to make sure CPU2 fetches memory for CPU1's cache. AND go over to CPU2 and flip the dirty bit if it has this bit of memory in cache...

    3) Wait for all the memory to be written by the bus. Depending on the implementer of the but, it's entirely possible to have CPU1's memory writes heading into memory, but not yet committed, when CPU2's request arrives, giving CPU2 a copy of old data! One way to try and fix this is...have CPU1 read-through-cache to the actual memory location, which the bus will flush correctly as the request is coming from the same device that did a previous write. (I used to do embedded programming and had to use this trick at times, it's possible this is the only bus that worked like this, YMMV).

    4) Release the locking mechanism and hope it's all correct.

    Realizing that a '1 in a million' chance of failure probably equates to months between failures at most, you see why bugs with this stuff appear all the time. If you MUST use shared memory as your interface for some reason, you better be really careful. And maybe look to move to a different method ASAP.

    Edit: changed memory controller to bus, oops

  • For those seeking more detail, Linux has a great reference on using memory barriers: http://www.kernel.org/doc/Documentation/memory-barriers.txt

  • This is a really interesting article. Multi-core ARM seems to be the first really mainstream processor architecture that behaves this way. There have been others, like Alpha, but none have achieved the ubiquity that mult-core ARM has achieved. I suspect a side-effect of this is that many of the "threads are hard" effects that are hidden by x86 will come back to bite a lot of programmers. I think we are going to be seeing a lot more "threads are hard" and "threads and weird" posts in the near future, and hopefully better learning material about threading issues in the longer term. Even more hopefully, this might drive more research and development into abstractions for providing parallelism and concurrency in ways that hide the complexity of threads.

  • Valgrind has tools that supposedly can find certain classes of load/store race conditions. I've never used them in anger, so I can't vouch for them, but it would be interesting to do a test on the example in the article.

    Memcheck is certainly a must-have tool for finding heisenbugs in low-level code - it would be wonderful to have an equally effective solution for race conditions.

    http://valgrind.org/docs/manual/hg-manual.html

    http://valgrind.org/docs/manual/drd-manual.html

  • Do, the memory barriers in ARM architecture also flush the caches? In Intel x86 architectures the hardware handles the coherency between all the caches, so a CPU core can directly read from the cache line of another core if it finds its own cache line to be dirty.. Does this happen in ARM also?

  • Yay memory semantics!

    A classic case where this sort of problem bit Java in the ass: the "double-checked locking pattern" for initializing Singletons. http://www.ibm.com/developerworks/java/library/j-dcl/index.h...

    I'm not sure if this was ever fixed / improved enough to allow the programmer to make this work.

  • and this is why you don't implement your own mutexes and use the ones provided by the OS.

  • A question: Why is it the CPU architecture that is weakly ordered, if it's the compiler that is reordering the statements? Couldn't you have a compiler on a weakly ordered arch that preserved order, and a compiler on x86 for example that could reorder your statements?

    Isn't it the language spec / compiler that is in charge of this, rather than the CPU? I'd like to know more about this.

  • Great article. CPU reordering is an effect which makes it notoriously difficult to implement lock-free code correctly.