JavaScript’s internal character encoding: UCS-2 or UTF-16?
> Both UCS-2 and UTF-16 are character encodings for Unicode.
UCS-2 is not an encoding that is generally compatible with Unicode. It's kind of like saying that 7-bit ASCII and UTF-8 are character encodings for Unicode.
"It produces a variable-length result of either one or two 16-bit code units per code point"
"It produces a variable-length result of either one or two 16-bit code units per code point"from the article.
I feel this should have been quoted or referenced in some way in the article. Or it might just be a very rare case of coincidence.from wikipedia.orgThe encoding is UTF-16, but what it calls "characters" are code units http://unicode.org/glossary/#code_unit, not code points http://unicode.org/glossary/#code_point.
For all the gory details about TC-39 work to possibly get rid of this restriction in ECMAScript and support full Unicode, venture to the TC-39 wiki:
http://wiki.ecmascript.org/doku.php?id=strawman:support_full...
Why doesn't everybody use UTF-8? How much overhead is incurred in encoding a non-ascii language (say Chinese) in UTF-8 compared to UTF-16?
Very nice write up. I was actually looking for something like this about a week ago, and was referred to the ECMAScript spec (section 8.4) which talked about "UTF-16 code units" - which I believe is just UCS-2. If this is the case, I kind of wonder if the spec should be updated to make things a little more clear, since the issue isn't straight forward for those who don't know a lot about unicode.
This means that for applications that want to store binary data as efficiently as possible in localStorage (e.g. Offline Wikipedia https://news.ycombinator.com/item?id=3409512), you can pack two bytes into each string character. ECMAScript strings are just arrays of 16-bit unsigned integers (e.g., '\ud800' is a valid JS string but is not valid UTF-16).