Free recode package

Previous: Charset level, Up: Library


4.5 Handling errors

The recode program, while using the Recode library, needs to control whether recoding problems are reported or not, and then reflect these in the exit status. The program should also instruct the library whether the recoding should be abruptly interrupted when an error is met (so sparing processing when it is known in advance that a wrong result would be discarded anyway), or if it should proceed nevertheless. Here is how the library groups errors into levels, listed here in order of increasing severity.

RECODE_NO_ERROR
No error was met on previous library calls.

RECODE_NOT_CANONICAL
The input text was using one of the many alternative codings for some phenomenon, but not the one Recode would have canonically generated. So, if the reverse recoding is later attempted, it would produce a text having the same meaning as the original text, yet not being byte identical.

For example, a Base64 block in which end-of-lines appear elsewhere that at every 76 characters is not canonical. An e-circumflex in TeX which is coded as ‘\^{e}’ instead of ‘\^e’ is not canonical.

RECODE_AMBIGUOUS_OUTPUT
It has been discovered that if the reverse recoding was attempted on the text output by this recoding, we would not obtain the original text, only because an ambiguity was generated by accident in the output text. This ambiguity would then cause the wrong interpretation to be taken.

Here are a few examples. If the Latin-1 sequence ‘e^’ is converted to Easy French and back, the result will be interpreted as e-circumflex and so, will not reflect the intent of the original two characters. Recoding an IBM-PC text to Latin-1 and back, where the input text contained an isolated LF, will have a spurious CR inserted before the LF.

Currently, there are many cases in the library where the production of ambiguous output is not properly detected, as it is sometimes a difficult problem to accomplish this detection, or to do it speedily.

RECODE_UNTRANSLATABLE
One or more input character could not be recoded, because there is just no representation for this character in the output charset.

Here are a few examples. Non-strict mode often allows Recode to compute on-the-fly mappings for unrepresentable characters, but strict mode prohibits such attribution of reversible translations: so strict mode might often trigger such an error. Most UCS-2 codes used to represent Asian characters cannot be expressed in various Latin charsets.

RECODE_INVALID_INPUT
The input text does not comply with the coding it is declared to hold. So, there is no way by which a reverse recoding would reproduce this text, because Recode should never produce invalid output.

Here are a few examples. In strict mode, ASCII text is not allowed to contain characters with the eight bit set. UTF-8 encodings ought to be minimal1.

RECODE_SYSTEM_ERROR
The underlying system reported an error while the recoding was going on, likely an input/output error. (This error symbol is currently unused in the library.)

RECODE_USER_ERROR
The programmer or user requested something the recoding library is unable to provide, or used the API wrongly. (This error symbol is currently unused in the library.)

RECODE_INTERNAL_ERROR
Something really wrong, which should normally never happen, was detected within the recoding library. This might be due to genuine bugs in the library, or maybe due to un-initialised or overwritten arguments to the API. (This error symbol is currently unused in the library.)

RECODE_MAXIMUM_ERROR
This error code should never be returned, it is only internally used as a sentinel for the list of all possible error codes.

One should be able to set the error level threshold for returning failure at end of recoding, and also the threshold for immediate interruption. If many errors occur while the recoding proceed, which are not severe enough to interrupt the recoding, then the most severe error is retained, while others are forgotten2. So, in case of an error, the possible actions currently are:

  • do nothing and let go, returning success at end of recoding,
  • just let go for now, but return failure at end of recoding,
  • interrupt recoding right away and return failure now.

See Task level, and particularly the description of the fields fail_level, abort_level and error_so_far, for more information about how errors are handled.


Notes de bas de page

[1] The minimality of an UTF-8 encoding is guaranteed on output, but currently, it is not checked on input.

[2] Another approach would have been to define the level symbols as masks instead, and to give masks to threshold setting routines, and to retain all errors—yet I never met myself such a need in practice, and so I fear it would be overkill. On the other hand, it might be interesting to maintain counters about how many times each kind of error occurred.