5 The universal charset
Standard ISO 10646
defines a universal character set,
intended to encompass in the long run all languages
written on this planet. It is based on wide
characters, and offer possibilities for two billion
characters (2^31).
This charset
was to become available in Recode under the name
UCS, with many external surfaces for
it. But in the current version, only surfaces of
UCS are offered, each presented as a
genuine charset rather than a surface. Such
surfaces are only meaningful for the
UCS charset, so it is not that useful
to draw a line between the surfaces and the only
charset to which they may apply.
UCS stands
for Universal Character Set. UCS-2 and
UCS-4 are fixed length encodings,
using two or four bytes per character respectively.
UTF stands for UCS
Transformation Format, and are variable length
encodings dedicated to UCS.
UTF-1 was based on ISO 2022
, it did not succeed1. UTF-2
replaced it, it has been called
UTF-FSS (File System Safe) in Unicode
or Plan9 context, but is better known today as
UTF-8. To complete the picture, there
is UTF-16 based on 16 bits bytes, and
UTF-7 which is meant for transmissions
limited to 7-bit bytes. Most often, one might see
UTF-8 used for external storage, and
UCS-2 used for internal storage.
When Recode
is producing any representation of
UCS, it uses the replacement character
U+FFFD for any valid
character which is not representable in the goal
charset2. This happens,
for example, when UCS-2 is not capable
to echo a wide UCS-4 character, or for
a similar reason, an UTF-8 sequence
using more than three bytes. The replacement
character is meant to represent an existing
character. So, it is never produced to represent an
invalid sequence or ill-formed character in the
input text. In such cases, Recode just gets rid of
the noise, while taking note of the error in its
usual ways.
Even if
UTF-8 is an encoding, really, it is
the encoding of a single character set, and nothing
else. It is useful to distinguish between an
encoding (a surface within Recode) and a
charset, but only when the surface may be applied
to several charsets. Specifying a charset is a bit
simpler than specifying a surface in a Recode
request. There would not be a practical advantage
at imposing a more complex syntax to Recode users,
when it is simple to assimilate UTF-8
to a charset. Similar considerations apply for
UCS-2, UCS-4,
UTF-16 and UTF-7. These
are all considered to be charsets.
|