5.1 Universal Character Set, 2 bytes
One
surface of UCS is usable for the
subset defined by its first sixty thousand
characters (in fact, 31 * 2^11 codes), and uses
exactly two bytes per character. It is a mere dump
of the internal memory representation which is
natural for this subset and as such,
conveys with it endianness problems.
A non-empty
UCS-2 file normally begins with a so
called byte order mark, having value
0xFEFF. The value 0xFFFE
is not an UCS character, so if this
value is seen at the beginning of a file, Recode
reacts by swapping all pairs of bytes. The library
also properly reacts to other occurrences of
0xFEFF or 0xFFFE
elsewhere than at the beginning, because
concatenation of UCS-2 files should
stay a simple matter, but it might trigger a
diagnostic about non canonical input.
By default,
when producing an UCS-2 file, Recode
always outputs the high order byte before the low
order byte. But this could be easily overridden
through the 21-Permutation surface
(see Permutations).
For example, the command:
recode u8..u2/21 < input > output
asks for an UTF-8 to
UCS-2 conversion, with swapped byte
output.
Use UCS-2 as a
genuine charset. This charset is available in
Recode under the name ISO-10646-UCS-2.
Accepted aliases are UCS-2,
BMP, rune and
u2.
The Recode
library is able to combine UCS-2 some
sequences of codes into single code characters, to
represent a few diacriticized characters, ligatures
or diphtongs which have been included to ease
mapping with other existing charsets. It is also
able to explode such single code characters into
the corresponding sequence of codes. The request
syntax for triggering such operations is
rudimentary and temporary. The
combined-UCS-2 pseudo character set is
a special form of UCS-2 in which known
combinings have been replaced by the simpler code.
Using combined-UCS-2 instead of
UCS-2 in an after position of
a request forces a combining step, while using
combined-UCS-2 instead of
UCS-2 in a before position of
a request forces an exploding step. For the time
being, one has to resort to advanced request syntax
to achieve other effects. For example:
recode u8..co,u2..u8 < input > output
copies an UTF-8
input over output, still to
be in UTF-8, yet merging combining
characters into single codes whenever possible.
|