I'd like to have a canonical place to pool information about Unicode support in various languages. Is it a part of the core language? Is it provided in libraries? Is it not available at all? Is there a resource popular resource for Unicode information in a language? One language per answer please. Also if you could make the language a heading that would make it easier to find.
Perl has built-in Unicode support, mostly. Sort of. From perldoc:
Python 3k (or 3.0 or 3000) has new approach for handling text (unicode) and data:
Text Vs. Data Instead Of Unicode Vs. 8-bit. See also Unicode HOWTO.
Same as with .NET, Java uses UTF-16 internally: java.lang.String
A
String
represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in theCharacter
class for more information). Index values refer tochar
code units, so a supplementary character uses two positions in aString
.
The Q command has complete Unicode support in most implementations.
Delphi 2009 fully supports Unicode. They've changed the implementation of string
to default to 16-bit Unicode encoding, and most libraries including the third party ones support Unicode. See Marco Cantù's Delphi and Unicode.
Prior to Delphi 2009, the support for Unicode was limited, but there was WideChar
and WideString
to store the 16-bit encoded string. See Unicode in Delphi for more info.
Note, you can still develop bilingual CJKV application without using Unicode. For example, Shift JIS encoded string for Japanese can be stored using plain AnsiString
.
Looks like before JS 1.3 there was no support for Unicode. As of 1.5, UTF-8, UTF-16 and UCS-2 are all supported. You can use Unicode escape sequences in strings, regexs and identifiers. Source
.NET stores strings internally as a sequence of System.Char
objects. One System.Char
represents a UTF-16 code unit.
From the MSDN documentation on System.Char
:
The .NET Framework uses the Char structure to represent a Unicode character. The Unicode Standard identifies each Unicode character with a unique 21-bit scalar number called a code point, and defines the UTF-16 encoding form that specifies how a code point is encoded into a sequence of one or more 16-bit values. Each 16-bit value ranges from hexadecimal 0x0000 through 0xFFFF and is stored in a Char structure.
Additional resources:
Tcl strings have been sequences of Unicode characters since Tcl 8.1 (1999). Internally, they are morphed dynamically between UTF-8 (strictly the same Modified UTF-8 as Java due to the handling of U+00000
characters) and UCS-2 (in host endianness and BOM, of course). All external strings (with one exception), including those used to communicate with the OS, are internally Unicode before being transformed into whatever encoding is required for the host (or is manually configured on a communications channel). The exception is for where data is copied between two communications channels with a common encoding (and a few other restrictions not germane here) where a direct copy-free binary transfer is used.
Characters outside the BMP are not currently handled either internally or externally. This is a known issue.
R6RS Scheme
Requires the implementation of Unicode 5.1. All strings are in 'unicode format'.
C before C99 has no built in unicode support. It uses zero terminated character arrays (char*
or char[]
) as strings. A char
is specified to by a byte (8 bits).
C99 specifies wcs
-functions in additions to the old str
-functions (e.g. strlen
-> wcslen
). These functions take wchar_t*
instead of char*
. wchar_t
stands for wide character type. The size of wchar_t
is compiler-specific and can be as small as 8 bits. While different compilers indeed use different sizes, it's usually 16-bit (UTF-16) or 32-bit (UTF-32).
Most C library functions are transparent to UTF-8. E.g. if your operating system supports UTF-8 (and UTF-8 is configured as your systems charset), then creating a file using fopen
passing an UTF-8 encoded string will create a properly named file.
The situation in C++ is very similar (std::string
-> std::wstring
), but there are at least efforts to get some sort of unicode support in the standard library.
char*
strings encoded in UTF-8. - dan04
Python 2 has the classes str
and unicode
. str
objects store bytes, unicode
objects store UTF-16 characters. Most library functions support both (e.g. os.listdir('.')
returns a list of str
, os.listdir(u'.')
returns a list of unicode
objects). Both have encode
and decode
methods.
Python 3 basically renamed unicode
to str
. The Python 3 equivalent to str
would be the type bytes
. bytes
has a decode
and str
an encode
method. Since Python 3.3 str
objects internally use one of several encodings in order to save memory. For a Python programmer it still looks like an abstract unicode sequence.
Python supports:
Python does not support/has limited support for:
See also: The Truth about Unicode in Python
None built-in, aside from whatever happens to be available as part of the C string library.
However, once you add frameworks…
NSString and CFString each implement a fully Unicode-based string class (actually several classes, as an implementation detail). The two are “toll-free-bridged” so that the API for one can be used with instances of the other, and vice versa.
For data that doesn't necessarily represent text, there's NSData and CFData. NSString provides methods and CFString provides functions to encode text into data and decode text from data. Core Foundation supports more than a hundred different encodings, including all forms of the UTFs. The encodings are divided into two groups: built-in encodings, which are supported everywhere, and external encodings, which are at least supported on Mac OS X.
NSString provides methods for normalizing to forms D, KD, C, or KC. Each returns a new string.
Both NSString and CFString provide a wide variety of comparison/collation options. Here are Foundation's comparison-option flags and Core Foundation's comparison-option flags. They are not all synonymous; for example, Core Foundation makes literal (strict code-point-based) comparison the default, whereas Foundation makes non-literal comparison (allowing characters with accents to compare equal) the default.
Note that Core Foundation does not require Objective-C; indeed, it was created pretty much to provide most of the features of Foundation to Carbon programmers, who used straight C or C++. However, I suspect most modern usage of it is in Cocoa or Cocoa Touch programs, which are all written in Objective-C or Objective-C++.
D supports UTF-8, UTF-16, and UTF-32 (char, wchar, and dchar, respectively). The table with all the types can be found here.
Rust's strings (std::String
and &str
) are always valid UTF-8, and do not use null terminators, and as a result can not be indexed as an array, like they can be in C/C++, etc. They can be sliced somewhat like Go using .get
since 1.20, with the caveat that it will fail if you try slicing the middle of a code point.
Rust also has OsStr
/OsString
for interacting with the Host OS. It's byte array on Unix (containing any sequence of bytes). On windows it's WTF-8 (A super-set of UTF-8 that handles the improperly formed Unicode strings that are allowed in Windows and Javascript), &str
and String
can be freely converted to OsStr
or OsString
, but require checks to covert the other way. Either by Failing on invalid unicode, or replacing with the Unicode replacement char. (There is also Path
/PathBuf
, which are just wrappers around OsStr
/OsString
).
There is also the CStr
and CString
types, which represent Null terminated C strings, like OsStr
on Unix they can contain arbitrary bytes.
Rust doesn't directly support UTF-16. But can convert OsStr
to UCS-2 on windows.
The only stuff I can find for Ruby is pretty old and not being much of a rubist, I'm not sure how accurate it is.
For the record, Ruby does support utf8, but not multibyte. Internally, it usually assumes strings are byte vectors, though there are libraries and tricks you can usually use to make things work.
Found that here.
Ruby 1.9 attaches encodings to strings. Binary strings use the encoding "ASCII-8BIT". While the default encoding is usually UTF-8 on any modern system, you cannot assume that all third party library functions always returns strings in this encoding. It might return any other encoding (e.g. some yaml parsers do that in some situations). If you concatenate two strings of different encoding you might get an Encoding::CompatibilityError
.
Arc doesn't have any unicode support. Yet.
Lua 5.3 has a built-in utf8
library, which handles the UTF-8 encoding. It allows you to convert a series of codepoints to the corresponding byte sequence and the other way around, get the length (the number of codepoints in a string), iterate over the codepoints in a string, get the byte position of the nth codepoint. It also provides a pattern, to be used by the pattern-matching functions in the string
library, that will match one UTF-8 byte sequence.
Lua 5.3 has Unicode code point escape sequences that can be used in string literals (for instance, "\u{61}"
for "a"
). They translate to UTF-8 byte sequences.
Lua source code can be encoded in UTF-8 or any encoding in which ASCII characters take up one byte. UTF-16 and UTF-32 are not understood by the vanilla Lua interpreter. But strings can contain any encoding, or arbitrary binary data.