5.4 Universal Transformation Format, 8
bits
Even if
UTF-8 does not originally come from
IETF, there is now RFC 2279 to describe
it. In letters sent on 1995-01-21 and 1995-04-20,
Markus Kuhn writes:
UTF-8 is an ASCII
compatible multi-byte encoding of the ISO 10646
universal character set
(UCS). UCS is a 31-bit
superset of all other character set standards.
The first 256 characters of UCS are
identical to those of ISO 8859-1
(Latin-1). The UCS-2
encoding of UCS is a sequence of bigendian 16-bit
words, the UCS-4 encoding is a
sequence of bigendian 32-bit words. The
UCS-2 subset of ISO 10646
is also known as “Unicode”. As both
UCS-2 and UCS-4 require
heavy modifications to traditional
ASCII oriented system designs (e.g.
Unix), the UTF-8 encoding has been
designed for these applications.
In UTF-8, only ASCII
characters are encoded using bytes below 128. All
other non-ASCII characters are encoded as
multi-byte sequences consisting only of bytes in
the range 128-253. This avoids critical bytes
like NUL and / in
UTF-8 strings, which makes the
UTF-8 encoding suitable for being
handled by the standard C string library and
being used in Unix file names. Other properties
include the preserved lexical sorting order and
that UTF-8 allows easy
self-synchronisation of software receiving
UTF-8 strings.
UTF-8 is the most common external
surface of UCS, each character uses
from one to six bytes, and is able to encode all
2^31 characters of the UCS. It is
implemented as a charset, with the following
properties:
- Strict 7-bit
ASCII is completely
invariant under UTF-8, and those are
the only one-byte characters. UCS
values and ASCII values coincide. No
multi-byte characters ever contain bytes less
than 128. NUL is
NUL. A multi-byte character always
starts with a byte of 192 or more, and is always
followed by a number of bytes between 128 to 191.
That means that you may read at random on disk or
memory, and easily discover the start of the
current, next or previous character. You can
count, skip or extract characters with this only
knowledge.
- If you read the first byte of a multi-byte
character in binary, it contains many
‘1’ bits
in successions starting with the most significant
one (from the left), at least two. The length of
this ‘1’
sequence equals the byte size of the character.
All succeeding bytes start by ‘10’. This is a lot of
redundancy, making it fairly easy to guess that a
file is valid
UTF-8, or to safely
state that it is not.
- In a multi-byte character, if you remove all
leading ‘1’ bits of the first byte of
a multi-byte character, and the initial
‘10’ bits
of all remaining bytes (so keeping 6 bits per
byte for those), the remaining bits concatenated
are the UCS value.
These properties also have a few nice
consequences:
- Conversion to/from values is algorithmically
simple, and reasonably speedy.
- A sequence of N bytes can hold
characters needing up to 2 + 5N bits
in their
UCS representation. Here,
N is a number between 1 and 6. So,
UTF-8 is most economical when
mapping ASCII (1 byte), followed by
UCS-2 (1 to 3 bytes) and
UCS-4 (1 to 6 bytes).
- The lexicographic sorting order of
UCS strings is preserved.
- Bytes with value 254 or 255 never appear, and
because of that, these are sometimes used when
escape mechanisms are needed.
In some
case, when little processing is done on a lot of
strings, one may choose for efficiency reasons to
handle UTF-8 strings directly even if
variable length, as it is easy to get start of
characters. Character insertion or replacement
might require moving the remainder of the string in
either direction. In most cases, it is faster and
easier to convert from UTF-8 to
UCS-2 or UCS-4 prior to
processing.
This charset is available in
Recode under the name UTF-8. Accepted
aliases are UTF-2,
UTF-FSS, FSS_UTF,
TF-8 and u8.
|