Free recode package

Previous: New surfaces, Up: Internals


14.4 Comments on the library design

  • Why a shared library? There are many different approaches to reduce system requirements to handle all tables needed in the Recode library. One of them is to have the tables in an external format and only read them in on demand. After having pondered this for a while, I finally decided against it, mainly because it involves its own kind of installation complexity, and it is not clear to me that it would be as interesting as I first imagined.

    It looks more efficient to see all tables and algorithms already mapped into virtual memory from the start of the execution, yet not loaded in actual memory, than to go through many disk accesses for opening various data files once the program is already started, as this would be needed with other solutions. Using a shared library also has the indirect effect of making various algorithms handily available, right in the same modules providing the tables. This alleviates much the burden of the maintenance.

    Of course, I would like to later make an exception for only a few tables, built locally by users for their own particular needs once Recode is installed. Recode should just go and fetch them. But I do not perceive this as very urgent, yet useful enough to be worth implementing.

    Currently, all tables needed for recoding are precompiled into binaries, and all these binaries are then made into a shared library. As an initial step, I turned Recode into a main program and a non-shared library, this allowed me to tidy up the API, get rid of all global variables, etc. It required a surprising amount of program source massaging. But once this cleaned enough, it was easy to use Gordon Matzigkeit's libtool package, and take advantage of the Automake interface to neatly turn the non-shared library into a shared one.

    Sites linking with the Recode library, whose system does not support any form of shared libraries, might end up with bulky executables. Surely, the Recode library will have to be used statically, and might not very nicely usable on such systems. It seems that progress has a price for those being slow at it.

    There is a locality problem I did not address yet. Currently, the Recode library takes many cycles to initialise itself, calling each module in turn for it to set up associated knowledge about charsets, aliases, elementary steps, recoding weights, etc. Then, the recoding sequence is decided out of the command given. I would not be surprised if initialisation was taking a perceivable fraction of a second on slower machines. One thing to do, most probably not right in version 3.5, but the version after, would have Recode to pre-load all tables and dump them at installation time. The result would then be compiled and added to the library. This would spare many initialisation cycles, but more importantly, would avoid calling all library modules, scattered through the virtual memory, and so, possibly causing many spurious page exceptions each time the initialisation is requested, at least once per program execution.

  • Why not a central charset?

    It would be simpler, and I would like, if something like ISO 10646 was used as a turning template for all charsets in Recode. Even if I think it could help to a certain extent, I'm still not fully sure it would be sufficient in all cases. Moreover, some people disagree about using ISO 10646 as the central charset, to the point I cannot totally ignore them, and surely, Recode is not a mean for me to force my own opinions on people. I would like that Recode be practical more than dogmatic, and reflect usage more than religions.

    Currently, if you ask Recode to go from charset1 to charset2 chosen at random, it is highly probable that the best path will be quickly found as:

              charset1..UCS-2..charset2
    

    That is, it will almost always use the UCS as a trampoline between charsets. However, UCS-2 will be immediately be optimised out, and charset1..charset2 will often be performed in a single step through a permutation table generated on the fly for the circumstance 1.

    In those few cases where UCS-2 is not selected as a conceptual intermediate, I plan to study if it could be made so. But I guess some cases will remain where UCS-2 is not a proper choice. Even if UCS is often the good choice, I do not intend to forcefully restrain Recode around UCS-2 (nor UCS-4) for now. We might come to that one day, but it will come out of the natural evolution of Recode. It will then reflect a fact, rather than a preset dogma.

  • Why not iconv?

    The iconv routine and library allows for converting characters from an input buffer to an input buffer, synchronously advancing both buffer cursors. If the output buffer is not big enough to receive all of the conversion, the routine returns with the input cursor set at the position where the conversion could later be resumed, and the output cursor set to indicate until where the output buffer has been filled. Despite this scheme is simple and nice, the Recode library does not offer it currently. Why not?

    When long sequences of decodings, stepwise recodings, and re-encodings are involved, as it happens in true life, synchronising the input buffer back to where it should have stopped, when the output buffer becomes full, is a difficult problem. Oh, we could make it simpler at the expense of losing space or speed: by inserting markers between each input character and counting them at the output end; by processing only one character in a time through the whole sequence; by repeatedly attempting to recode various subsets of the input buffer, binary searching on their length until the output just fits. The overhead of such solutions looks prohibitive to me, and the gain very minimal. I do not see a real advantage, nowadays, imposing a fixed length to an output buffer. It makes things so much simpler and efficient to just let the output buffer size float a bit.

    Of course, if the above problem was solved, the iconv library should be easily emulated, given that Recode has similar knowledge about charsets, of course. This either solved or not, the iconv program remains trivial (given similar knowledge about charsets). I also presume that the genxlt program would be easy too, but I do not have enough detailed specifications of it to be sure.

    A lot of years ago, Recode was using a similar scheme, and I found it rather hard to manage for some cases. I rethought the overall structure of Recode for getting away from that scheme, and never regretted it. I perceive iconv as an artificial solution which surely has some elegances and virtues, but I do not find it really useful as it stands: one always has to wrap iconv into something more refined, extending it for real cases. From past experience, I think it is unduly hard to fully implement this scheme. It would be awkward that we do contortions for the sole purpose of implementing exactly its specification, without real, properly grounded reasons (other then the fact some people once thought it was worth standardising). It is much better to immediately aim for the refinement we need, without uselessly forcing us into the dubious detour iconv represents.

    Some may argue that if Recode was using a comprehensive charset as a turning template, as discussed in a previous point, this would make iconv easier to implement. Some may be tempted to say that the cases which are hard to handle are not really needed, nor interesting, anyway. I feel and fear a bit some pressure wanting that Recode be split into the part that well fits the iconv model, and the part that does not fit, considering this second part less important, with the idea of dropping it one of these days, maybe. My guess is that users of the Recode library, whatever its form, would not like to have such arbitrary limitations. In the long run, we should not have to explain to our users that some recodings may not be made available just because they do not fit the simple model we had in mind when we did it. Instead, we should try to stay open to the difficulties of real life. There is still a lot of complex needs for Asian people, say, that Recode does not currently address, while it should. Not only the doors should stay open, but we should force them wider!


Notes de bas de page

[1] If strict mapping is requested, another efficient device will be used instead of a permutation.