Breaking "provably correct" Leftpad

  • Every time I hear people talking about ensuring software correctness, it reminds me of this story: https://horningtales.blogspot.com/2006/09/exhaustive-testing...

    A university group in the 1960s finds a vendor-supplied binary-to-BCD conversion function sometimes produces an off-by-one error.

    They devise a simple fix - but find it adds an extra 'drum revolution' and so write an even more refined fix, that produces the right answer without taking any extra time.

    Then they test it, over the course of several weeks, counting from 0 to 9,999,999, both in binary and in BCD, converting the binary to BCD, and comparing the results.

    They proudly send the perfected implementation to the vendor - who sends it on to other users of the machine. Soon after they receive a phone call: "Were you aware that your new routine drops the sign of negative numbers?"

  • So it's not really leftpad that's broken, it's that each language has a different definition of "String.length", generally corresponding to the definition of the underlying encoding, when dealing with accents, emojis, foreign characters, etc., and which notably does not always correspond to the number of spaces occupied by a string a monospace font.

    IOW the proofs are correct: leftpad will result in spaces on the left, input string on the right, and String.length as specified. It's the spec itself that was incorrect: the last requirement should be based on "number of spaces occupied by the string in a monospace font", not "string.length", if that's what the desired requirement is.

    That said, I think that's largely the author's point. You can prove that your code meets the spec, but you can't prove that your spec meets what you actually intended.

  • The “random” choice of swift was quite fortunate since what this really seems to be testing is the ergonomics of the Unicode “character” definition in the standard libraries used, and swift has the best defaults of the languages mentioned haha

  • Hillel Wayne posted a followup https://buttondown.com/hillelwayne/archive/three-ways-formal... which may be interesting. Essentially the issue is "what does the Length of a string mean?"

  • > The leftpad function provided[1] didn’t take a string, but a char[]. Thankfully, it’s easy to convert from String objects, using the .toCharArray() function. So I did that.

    Java's unicode handling is a monumental footgun of the most devastating variety where it works for most common cases, and almost all code that is not written with care to how it is handled will not deal well with code points that require more than 2 bytes to represent.

    If you insist on using a char array (which is a bit unidiomatic), you should be using Character$charPointCount[2] to figure out how many code points are in the array, and even then you're probably SOL with regards to non-trivial emojis. String$charPointCount[3] is also an option if you want to use String and StringBuilder to do the padding, which arguably be more idiomatic.

    [1] https://github.com/hwayne/lets-prove-leftpad/blob/ea9c0f09a2...

    [2] https://docs.oracle.com/en/java/javase/25/docs/api//java.bas...

    [3] https://docs.oracle.com/en/java/javase/25/docs/api//java.bas...

  • I think to get it right, leftpad will have to link to libghostty :)

  • For anyone else who was curious C#/.NET:

    By default string.Length measures characters, but System.Globalization.StringInfo is provided to work with "text elements".

    Unlike System.String, StringInfo doesn't have a built-in PadLeft function in it's base library. But it gets the length "right" by the author's standard.

    Code to show lengths:

      using System;
      using System.Globalization;
         
      public class Program
      {
       public static void Main()
       {
        var weirdStrings = new string[] {"𝄞","Å","𓀾","אֳֽ֑","résumé","résumé"};
        foreach(var weirdString in weirdStrings){
         var asStringInfo = new StringInfo(weirdString);
       Console.WriteLine($"{weirdString.PadLeft(10,'-')} {weirdString.Length} {asStringInfo.LengthInTextElements}");
        }
       }
      }

  • If you skimmed the article and missed the subtle notes, the author deliberately chose the wrong way to use the Rust version to make a point. Even ChatGPT told him it was wrong for his use case:

    > As mentioned, the other way to use the Rust version has the same behaviour as the Haskell/Lean/etc versions. ChatGPT did actually point out to me that this way was better, and the other way (the one I used) was adequate only if the input was limited to ASCII.

  • Further reading on the Knuth correctness quote https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf