Free recode package

Previous: count-characters, Up: Universal


5.7 Fully interpreted UCS dump

Another device may be used to get fully interpreted dumps of an UCS-2 stream of characters, with one UCS-2 character displayed on a full output line. Each line receives the RFC 1345 mnemonic for the character if it exists, the UCS-2 value of the character, and a descriptive comment for that character. As each input character produces its own output line, beware that the output file from this conversion may be much, much bigger than the input file.

This charset is available in Recode under the name dump-with-names.

This dump-with-names feature has been implemented as a charset rather than a surface. This is surely debatable. The current implementation allows for dumping charsets other than UCS-2. For example, the command ‘recode l2..full < input implies a necessary conversion from Latin-2 to UCS-2, as dump-with-names is only connected out from UCS-2. In such cases, Recode does not display the original Latin-2 codes in the dump, only the corresponding UCS-2 values. To give a simpler example, the command

     echo 'Hello, world!' | recode us..dump

produces the following output:

     UCS2   Mne   Description
     
     0048   H     latin capital letter h
     0065   e     latin small letter e
     006C   l     latin small letter l
     006C   l     latin small letter l
     006F   o     latin small letter o
     002C   ,     comma
     0020   SP    space
     0077   w     latin small letter w
     006F   o     latin small letter o
     0072   r     latin small letter r
     006C   l     latin small letter l
     0064   d     latin small letter d
     0021   !     exclamation mark
     000A   LF    line feed (lf)

The descriptive comment is given in English and ASCII, yet if the English description is not available but a French one is, then the French description is given instead, using Latin-1. However, if the LANGUAGE or LANG environment variable begins with the letters ‘fr’, then listing preference goes to French when both descriptions are available.

Here is another example. To get the long description of the code 237 in Latin-5 table, one may use the following command.

     echo -n 237 | recode l5/d..dump

If your echo does not grok ‘-n’, use ‘echo 237\c’ instead. Here is how to see what Unicode U+03C6 means, while getting rid of the title lines.

     echo -n 0x03C6 | recode u2/x2..dump | tail +3