14.4 Comments on the library design
- Why a shared library?
There are many different approaches to reduce
system requirements to handle all tables needed
in the Recode library. One of them is to have the
tables in an external format and only read them
in on demand. After having pondered this for a
while, I finally decided against it, mainly
because it involves its own kind of installation
complexity, and it is not clear to me that it
would be as interesting as I first imagined.
It looks more efficient to see all tables
and algorithms already mapped into virtual
memory from the start of the execution, yet not
loaded in actual memory, than to go through
many disk accesses for opening various data
files once the program is already started, as
this would be needed with other solutions.
Using a shared library also has the indirect
effect of making various algorithms handily
available, right in the same modules providing
the tables. This alleviates much the burden of
the maintenance.
Of course, I would like to later make an
exception for only a few tables, built locally
by users for their own particular needs once
Recode is installed. Recode should just go and
fetch them. But I do not perceive this as very
urgent, yet useful enough to be worth
implementing.
Currently, all tables needed for recoding
are precompiled into binaries, and all these
binaries are then made into a shared library.
As an initial step, I turned Recode into a main
program and a non-shared library, this allowed
me to tidy up the API, get rid of all global
variables, etc. It required a surprising amount
of program source massaging. But once this
cleaned enough, it was easy to use Gordon
Matzigkeit's libtool package, and
take advantage of the Automake interface to
neatly turn the non-shared library into a
shared one.
Sites linking with the Recode library, whose
system does not support any form of shared
libraries, might end up with bulky executables.
Surely, the Recode library will have to be used
statically, and might not very nicely usable on
such systems. It seems that progress has a
price for those being slow at it.
There is a locality problem I did not
address yet. Currently, the Recode library
takes many cycles to initialise itself, calling
each module in turn for it to set up associated
knowledge about charsets, aliases, elementary
steps, recoding weights, etc. Then,
the recoding sequence is decided out of the
command given. I would not be surprised if
initialisation was taking a perceivable
fraction of a second on slower machines. One
thing to do, most probably not right in version
3.5, but the version after, would have Recode
to pre-load all tables and dump them at
installation time. The result would then be
compiled and added to the library. This would
spare many initialisation cycles, but more
importantly, would avoid calling all library
modules, scattered through the virtual memory,
and so, possibly causing many spurious page
exceptions each time the initialisation is
requested, at least once per program
execution.
- Why not a central charset?
It would be simpler, and I would like, if
something like ISO 10646 was used
as a turning template for all charsets in
Recode. Even if I think it could help to a
certain extent, I'm still not fully sure it
would be sufficient in all cases. Moreover,
some people disagree about using ISO 10646
as the central charset, to the
point I cannot totally ignore them, and surely,
Recode is not a mean for me to force my own
opinions on people. I would like that Recode be
practical more than dogmatic, and reflect usage
more than religions.
Currently, if you ask Recode to go from
charset1 to charset2
chosen at random, it is highly probable that
the best path will be quickly found as:
charset1..UCS-2..charset2
That is, it will almost always use the
UCS as a trampoline between
charsets. However, UCS-2 will be
immediately be optimised out, and
charset1..charset2 will
often be performed in a single step through a
permutation table generated on the fly for the
circumstance 1.
In those few cases where UCS-2
is not selected as a conceptual intermediate, I
plan to study if it could be made so. But I
guess some cases will remain where
UCS-2 is not a proper choice. Even
if UCS is often the good choice, I
do not intend to forcefully restrain Recode
around UCS-2 (nor
UCS-4) for now. We might come to
that one day, but it will come out of the
natural evolution of Recode. It will then
reflect a fact, rather than a preset dogma.
- Why not
iconv?
The
iconv routine and library allows
for converting characters from an input buffer
to an input buffer, synchronously advancing
both buffer cursors. If the output buffer is
not big enough to receive all of the
conversion, the routine returns with the input
cursor set at the position where the conversion
could later be resumed, and the output cursor
set to indicate until where the output buffer
has been filled. Despite this scheme is simple
and nice, the Recode library does not offer it
currently. Why not?
When long sequences of decodings, stepwise
recodings, and re-encodings are involved, as it
happens in true life, synchronising the input
buffer back to where it should have stopped,
when the output buffer becomes full, is a
difficult problem. Oh, we could make it simpler
at the expense of losing space or speed: by
inserting markers between each input character
and counting them at the output end; by
processing only one character in a time through
the whole sequence; by repeatedly attempting to
recode various subsets of the input buffer,
binary searching on their length until the
output just fits. The overhead of such
solutions looks prohibitive to me, and the gain
very minimal. I do not see a real advantage,
nowadays, imposing a fixed length to an output
buffer. It makes things so much simpler and
efficient to just let the output buffer size
float a bit.
Of course, if the above problem was solved,
the iconv library should be easily
emulated, given that Recode has similar
knowledge about charsets, of course. This
either solved or not, the iconv
program remains trivial (given similar
knowledge about charsets). I also presume that
the genxlt program would be easy
too, but I do not have enough detailed
specifications of it to be sure.
A lot of years ago, Recode was using a
similar scheme, and I found it rather hard to
manage for some cases. I rethought the overall
structure of Recode for getting away from that
scheme, and never regretted it. I perceive
iconv as an artificial solution
which surely has some elegances and virtues,
but I do not find it really useful as it
stands: one always has to wrap
iconv into something more refined,
extending it for real cases. From past
experience, I think it is unduly hard to fully
implement this scheme. It would be awkward that
we do contortions for the sole purpose of
implementing exactly its specification, without
real, properly grounded reasons (other then the
fact some people once thought it was worth
standardising). It is much better to
immediately aim for the refinement we need,
without uselessly forcing us into the dubious
detour iconv represents.
Some may argue that if Recode was using a
comprehensive charset as a turning template, as
discussed in a previous point, this would make
iconv easier to implement. Some
may be tempted to say that the cases which are
hard to handle are not really needed, nor
interesting, anyway. I feel and fear a bit some
pressure wanting that Recode be split into the
part that well fits the iconv
model, and the part that does not fit,
considering this second part less important,
with the idea of dropping it one of these days,
maybe. My guess is that users of the Recode
library, whatever its form, would not like to
have such arbitrary limitations. In the long
run, we should not have to explain to our users
that some recodings may not be made available
just because they do not fit the simple model
we had in mind when we did it. Instead, we
should try to stay open to the difficulties of
real life. There is still a lot of complex
needs for Asian people, say, that Recode does
not currently address, while it should. Not
only the doors should stay open, but we should
force them wider!
|
|
|