Hacker News Clone

JavaScript’s internal character encoding: UCS-2 or UTF-16?

by mathias on 1/20/2012, 12:03 PM with 15 comments

by __alexs on 1/20/2012, 1:10 PM
> Both UCS-2 and UTF-16 are character encodings for Unicode.
UCS-2 is not an encoding that is generally compatible with Unicode. It's kind of like saying that 7-bit ASCII and UTF-8 are character encodings for Unicode.
by hmottestad on 1/20/2012, 4:09 PM
"It produces a variable-length result of either one or two 16-bit code units per code point"
```
  from the article.
```
"It produces a variable-length result of either one or two 16-bit code units per code point"
```
  from wikipedia.org
```
I feel this should have been quoted or referenced in some way in the article. Or it might just be a very rare case of coincidence.
by lambda on 1/20/2012, 3:54 PM
The encoding is UTF-16, but what it calls "characters" are code units http://unicode.org/glossary/#code_unit, not code points http://unicode.org/glossary/#code_point.
by apaprocki on 1/20/2012, 2:55 PM
For all the gory details about TC-39 work to possibly get rid of this restriction in ECMAScript and support full Unicode, venture to the TC-39 wiki:
http://wiki.ecmascript.org/doku.php?id=strawman:support_full...
by herge on 1/20/2012, 5:12 PM
Why doesn't everybody use UTF-8? How much overhead is incurred in encoding a non-ascii language (say Chinese) in UTF-8 compared to UTF-16?
by patorjk on 1/20/2012, 1:17 PM
Very nice write up. I was actually looking for something like this about a week ago, and was referred to the ECMAScript spec (section 8.4) which talked about "UTF-16 code units" - which I believe is just UCS-2. If this is the case, I kind of wonder if the spec should be updated to make things a little more clear, since the issue isn't straight forward for those who don't know a lot about unicode.
by yonran on 1/20/2012, 5:47 PM
This means that for applications that want to store binary data as efficiently as possible in localStorage (e.g. Offline Wikipedia https://news.ycombinator.com/item?id=3409512), you can pack two bytes into each string character. ECMAScript strings are just arrays of 16-bit unsigned integers (e.g., '\ud800' is a valid JS string but is not valid UTF-16).