Free recode package

Next: , Previous: Main flow, Up: Internals


14.2 Adding new charsets

The main part of Recode is written in C, as are most single steps. A few single steps need to recognise sequences of multiple characters, they are often better written in Flex. It is easy for a programmer to add a new charset to Recode. All it requires is making a few functions kept in a single .c file, adjusting Makefile.am and remaking Recode.

One of the function should convert from any previous charset to the new one. Any previous charset will do, but try to select it so you will not lose too much information while converting. The other function should convert from the new charset to any older one. You do not have to select the same old charset than what you selected for the previous routine. Once again, select any charset for which you will not lose too much information while converting.

If, for any of these two functions, you have to read multiple bytes of the old charset before recognising the character to produce, you might prefer programming it in Flex in a separate .l file. Prototype your C or Flex files after one of those which exist already, so to keep the sources uniform. Besides, at make time, all .l files are automatically merged into a single big one by the script mergelex.awk.

There are a few hidden rules about how to write new Recode modules, for allowing the automatic creation of decsteps.h and initsteps.h at make time, or the proper merging of all Flex files. Mimetism is a simple approach which relieves me of explaining all these rules! Start with a module closely resembling what you intend to do. Here is some advice for picking up a model. First decide if your new charset module is to be be driven by algorithms rather than by tables. For algorithmic recodings, see iconqnx.c for C code, or txtelat1.l for Flex code. For table driven recodings, see ebcdic.c for one-to-one style recodings, lat1html.c for one-to-many style recodings, or atarist.c for double-step style recodings. Just select an example from the style that better fits your application.

Each of your source files should have its own initialisation function, named module_charset, which is meant to be executed quickly once, prior to any recoding. It should declare the name of your charsets and the single steps (or elementary recodings) you provide, by calling declare_step one or more times. Besides the charset names, declare_step expects a description of the recoding quality (see recodext.h) and two functions you also provide.

The first such function has the purpose of allocating structures, pre-conditioning conversion tables, etc. It is also the way of further modifying the STEP structure. This function is executed if and only if the single step is retained in an actual recoding sequence. If you do not need such delayed initialisation, merely use NULL for the function argument.

The second function executes the elementary recoding on a whole file. There are a few cases when you can spare writing this function:

  • Some single steps do nothing else than a pure copy of the input onto the output, in this case, you can use the predefined function file_one_to_one, while having a delayed initialisation for presetting the STEP field one_to_one to the predefined value one_to_same.
  • Some single steps are driven by a table which recodes one character into another; if the recoding does nothing else, you can use the predefined function file_one_to_one, while having a delayed initialisation for presetting the STEP field one_to_one with your table.
  • Some single steps are driven by a table which recodes one character into a string; if the recoding does nothing else, you can use the predefined function file_one_to_many, while having a delayed initialisation for presetting the STEP field one_to_many with your table.

If you have a recoding table handy in a suitable format but do not use one of the predefined recoding functions, it is still a good idea to use a delayed initialisation to save it anyway, because recode option ‘-h’ will take advantage of this information when available.

Finally, edit Makefile.am to add the source file name of your routines to the C_STEPS or L_STEPS macro definition, depending on the fact your routines is written in C or in Flex.