Hacker News Clone

Linux 6.1 will make it a bit easier to help spot faulty CPUs

by HieronymusBosch on 8/25/2022, 3:18 PM with 7 comments

by CJefferson on 8/25/2022, 5:03 PM
This reminds me of Raymond Chen's blog post about overclocked CPUs ( https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35... ), where Microsoft were getting reports of crashes in CPU instructions which should be unable to cause segfaults, like xor register with itself.
I imagine if you are getting such crashes, it would be useful to see exactly where they are coming from!
by tester756 on 8/25/2022, 11:29 PM
is it related to that?
>Cores that don't count >https://research.google/pubs/pub50337/
>We are accustomed to thinking of computers as fail-stop, es- pecially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are intro- duced to improve performance, we have observed ephemeral computational errors that were not detected during manu- facturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often “silent” – the only symptom is an erroneous computation. We refer to a core that develops such behavior as “mercu- rial.” Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem – one that will re- quire collaboration between hardware designers, processor vendors, and systems software architects. >Violations of lock semantics leading to application data-corruption and crashes.