14.1 Overall organisation
The Recode mechanics slowly evolved for many years,
and it would be tedious to explain all problems I
met and mistakes I did all along, yielding the
current behaviour. Surely, one of the key choices
was to stop trying to do all conversions in memory,
one line or one buffer at a time. It has been
fruitful to use the character stream paradigm, and
the elementary recoding steps now convert a whole
stream to another. Most of the control complexity
in Recode exists so that each elementary recoding
step stays simple, making easier to add new ones.
The whole point of Recode, as I see it, is
providing a comfortable nest for growing new
charset conversions.
The main Recode driver
constructs, while initialising all conversion
modules, a table giving all the conversion routines
available (single steps) and for each,
the starting charset and the ending charset. If we
consider these charsets as being the nodes of a
directed graph, each single step may be considered
as oriented arc from one node to the other. A cost
is attributed to each arc: for example, a high
penalty is given to single steps which are prone to
losing characters, a lower penalty is given to
those which need studying more than one input
character for producing an output character,
etc.
Given a
starting code and a goal code, Recode computes the
most economical route through the elementary
recodings, that is, the best sequence of
conversions that will transform the input charset
into the final charset. To speed up execution,
Recode looks for subsequences of conversions which
are simple enough to be merged, and then
dynamically creates new single steps to represent
these mergings.
A double
step in Recode is a special concept
representing a sequence of two single steps, the
output of the first single step being the special
charset UCS-2, the input of the second
single step being also UCS-2. Special
Recode machinery dynamically produces efficient,
reversible, merge-able single steps out of these
double steps.
I
made some statistics about how many internal
recoding steps are required between any two
charsets chosen at random. The initial recoding
layout, before optimisation, always uses between 1
and 5 steps. Optimisation could sometimes produce
mere copies, which are counted as no steps at all.
In other cases, optimisation is unable to save any
step. The number of steps after optimisation is
currently between 0 and 5 steps. Of course, the
expected number of steps is affected by
optimisation: it drops from 2.8 to 1.8. This means
that Recode uses a theoretical average of a bit
less than one step per recoding job. This looks
good. This was computed using reversible recodings.
In strict mode, optimisation might be defeated
somewhat. Number of steps run between 1 and 6, both
before and after optimisation, and the expected
number of steps decreases by a lesser amount, going
from 2.2 to 1.3. This is still manageable.
|