3.7 Using mixed charset input
In real
life and practice, textual files are often made up
of many charsets at once. Some parts of the file
encode one charset, while other parts encode
another charset, and so forth. Usually, a file does
not toggle between more than two or three charsets.
The means to distinguish which charsets are encoded
at various places is not always available. Recode
is able to handle only a few simple cases of mixed
input.
The default
Recode behaviour is to expect pure charset files,
to be recoded as other pure charset files. However,
the following options allow for a few precise kinds
of mixed charset files.
- ‘-d’
- ‘--diacritics’
-
While
converting to or from one of
HTML
or LaTeX charset, limit conversion
to some subset of all characters. For
HTML, limit conversion to the
subset of all non-ASCII characters. For
LaTeX, limit conversion to the
subset of all non-English letters. This is
particularly useful, for example, when people
create what would be valid HTML,
TeX or LaTeX files, if only they were using
provided sequences for applying diacritics
instead of using the diacriticised characters
directly from the underlying character set.
While converting to HTML or
LaTeX charset, this option assumes
that characters not in the said subset are
properly coded or protected already, Recode
then transmit them literally. While converting
the other way, this option prevents translating
back coded or protected versions of characters
not in the said subset. See HTML. See LaTeX.
- ‘-S[language]’
- ‘--source[=language]’
-
The
bulk of the input file is expected to be
written in
ASCII, except for
parts, like comments and string constants,
which are written using another charset than
ASCII. When language is
‘c’, the
recoding will proceed only with the contents of
comments or strings, while everything else will
be copied without recoding. When
language is ‘po’, the recoding will
proceed only within translator comments (those
having whitespace immediately following the
initial ‘#’) and with the contents
of msgstr strings.
For the above things to work, the
non-ASCII encoding of the comment
or string should be such that an
ASCII scan will successfully find
where the comment or string ends.
Even if ASCII is the usual
charset for writing programs, some compilers
are able to directly read other charsets, like
UTF-8, say. There is currently no
provision in Recode for reading mixed charset
sources which are not based on
ASCII. It is probable that the
need for mixed recoding is not as pressing in
such cases.
For example, after one does:
recode -Spo pc/..u8 < input.po > output.po
file
output.po holds a copy of
input.po in which only
translator comments and the contents of
msgstr strings have been recoded
from the IBM-PC charset to pure
UTF-8, without attempting
conversion of end-of-lines. Machine generated
comments and original msgid
strings are not to be touched by this
recoding.
If language is not specified,
‘c’ is
assumed.
|