14.2 Adding new charsets
The
main part of Recode is written in C, as are most
single steps. A few single steps need to recognise
sequences of multiple characters, they are often
better written in Flex. It is easy for a programmer
to add a new charset to Recode. All it requires is
making a few functions kept in a single
.c file,
adjusting Makefile.am and remaking
Recode.
One of the
function should convert from any previous charset
to the new one. Any previous charset will do, but
try to select it so you will not lose too much
information while converting. The other function
should convert from the new charset to any older
one. You do not have to select the same old charset
than what you selected for the previous routine.
Once again, select any charset for which you will
not lose too much information while converting.
If, for any
of these two functions, you have to read multiple
bytes of the old charset before recognising the
character to produce, you might prefer programming
it in Flex in a separate .l file. Prototype your C or
Flex files after one of those which exist already,
so to keep the sources uniform. Besides, at
make time, all .l files are automatically
merged into a single big one by the script
mergelex.awk.
There are a
few hidden rules about how to write new Recode
modules, for allowing the automatic creation of
decsteps.h
and initsteps.h at
make time, or the proper merging of
all Flex files. Mimetism is a simple approach which
relieves me of explaining all these rules! Start
with a module closely resembling what you intend to
do. Here is some advice for picking up a model.
First decide if your new charset module is to be be
driven by algorithms rather than by tables. For
algorithmic recodings, see iconqnx.c for C code, or
txtelat1.l
for Flex code. For table driven recodings, see
ebcdic.c for
one-to-one style recodings, lat1html.c for one-to-many
style recodings, or atarist.c for double-step
style recodings. Just select an example from the
style that better fits your application.
Each of
your source files should have its own
initialisation function, named
module_charset, which is
meant to be executed quickly once, prior
to any recoding. It should declare the name of your
charsets and the single steps (or elementary
recodings) you provide, by calling
declare_step one or more times.
Besides the charset names,
declare_step expects a description of
the recoding quality (see recodext.h) and two functions
you also provide.
The first
such function has the purpose of allocating
structures, pre-conditioning conversion tables,
etc. It is also the way of further modifying the
STEP structure. This function is
executed if and only if the single step is retained
in an actual recoding sequence. If you do not need
such delayed initialisation, merely use
NULL for the function argument.
The second
function executes the elementary recoding on a
whole file. There are a few cases when you can
spare writing this function:
- Some
single steps do nothing else than a pure copy of
the input onto the output, in this case, you can
use the predefined function
file_one_to_one, while having a
delayed initialisation for presetting the
STEP field one_to_one
to the predefined value
one_to_same.
- Some single steps are driven by a table which
recodes one character into another; if the
recoding does nothing else, you can use the
predefined function
file_one_to_one,
while having a delayed initialisation for
presetting the STEP field
one_to_one with your table.
- Some
single steps are driven by a table which recodes
one character into a string; if the recoding does
nothing else, you can use the predefined function
file_one_to_many, while having a
delayed initialisation for presetting the
STEP field one_to_many
with your table.
If you have
a recoding table handy in a suitable format but do
not use one of the predefined recoding functions,
it is still a good idea to use a delayed
initialisation to save it anyway, because
recode option ‘-h’ will take advantage of
this information when available.
Finally,
edit Makefile.am to add the source
file name of your routines to the
C_STEPS or L_STEPS macro
definition, depending on the fact your routines is
written in C or in Flex.
|