Go to the first, previous, next, last section, table of contents.


Various other charsets

Even if these charsets were originally added to recode for handling texts written in French, they find other uses. We did use them lot for writing French diacriticised texts in the past, so recode knows how to handle these particularly well for French texts.

World Wide Web representations

This charset is available in recode under the name HTML, with h4 as an acceptable alias.

HTML texts used by World Wide Web often use special sequences, beginning with an ampersand & and ending with a semicolon ;, for representing characters. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.

Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).

When you recode from another charset to HTML, beware that all occurrences of double quotes, ampersands, and left or right angle brackets are translated into special sequences. However, in practice, people often use ampersands and angle brackets in the other charset for introducing HTML commands, compromising it: it is not pure HTML, not it is pure other charset. These particular translations can be rather inconvenient, they may be specifically inhibited through the command option -d (see section Using mixed charset input).

Codes not having a mnemonic entity are output by recode using the `&#nnn;' notation, where nnn is a decimal representation of the UCS code value. When there is an entity name for a character, it is always preferred over a numeric character reference. ASCII printable characters are always generated directly. So is the newline. While reading HTML, recode supports numeric character reference as alternate writings, even when written as hexadecimal numbers, as in `&#xfffd'. This is documented in:

http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3

About levels of HTML, François Yergeau yergeau@alis.com writes:

When recode translates to HTML, the translation occurs according to http://www.w3.org/TR/REC-html40/sgml/entities.html. It is also assumed that RFC 1866 has an equivalent contents. When translating from HTML, recode accepts some alternative special sequences, to be forgiving when files use older HTML tables.

The recode program can be used to normalise an HTML file using oldish conventions. For example, it accepts `&AE;', as this once was a valid writing, somewhere. However, it should always produce `Æ' instead of `&AE;'. Yet, this is not completely true. If one does:

recode h3..h3 < input

the operation will be optimised into a mere copy, and you can get `&AE;' this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:

recode h3..u2,u2..h3 < input

then `&AE;' should be normalised into `&AElig;' by the operation.

LaTeX macro calls

This charset is available in recode under the name LaTeX and has ltex as an alias. It is used for ASCII files coded to be read by LaTeX or, in certain cases, by TeX.

Whenever you recode from another charset to LaTeX, beware that all occurrences of backslashes \ are translated into the string `\backslash{}'. However, in practice, people often use backslashes in the other charset for introducing TeX commands, compromising it: it is not pure TeX, nor it is pure other charset. This translation of backslashes into `\backslash{}' can be rather inconvenient, it may be inhibited through the command option -d (see section Using mixed charset input).

GNU project documentation files

This charset is available in recode under the name Texinfo and has texi and ti for aliases. It is used by the GNU project for its documentation. Texinfo files may be converted into Info files by the makeinfo program and into nice printed manuals by the TeX system.

Even if recode may transform other charsets to Texinfo, it may not read Texinfo files yet. In these times, usages are also changing between versions of Texinfo, and recode only partially succeeds in correctly following these changes. So, for now, Texinfo support in recode should be considered as work still in progress (!).

African charsets

Some African character sets are available for a few languages, when these are heavily used in countries where French is also currently spoken.

One African charset is usable for Bambara, Ewondo and Fulfude, as well as for French. This charset is available in recode under the name AFRFUL-102-BPI_OCIL. Accepted aliases are bambara, bra, ewondo and fulfude. Transliterated forms of the same are available under the name AFRFUL-103-BPI_OCIL. Accepted aliases are t-bambara, t-bra, t-ewondo and t-fulfude.

Another African charset is usable for Lingala, Sango and Wolof, as well as for French. This charset is available in recode under the name AFRLIN-104-BPI_OCIL. Accepted aliases are lingala, lin, sango and wolof. Transliterated forms of the same are available under the name AFRLIN-105-BPI_OCIL. Accepted aliases are t-lingala, t-lin, t-sango and t-wolof.

To ease exchange with ISO-8859-1, there is a charset conveying transliterated forms for Latin-1 in a way which is compatible with the other African charsets in this series. This charset is available in recode under the name AFRL1-101-BPI_OCIL. Accepted aliases are t-fra and t-francais.

Cyrillic charsets

The following Cyrillic charsets are already available in recode through RFC 1345 tables: CP1251 with aliases 1251, ms-cyrl and windows-1251; CSN_369103 with aliases ISO-IR-139 and KOI8_L2; ECMA-cyrillic with aliases ECMA-113, ECMA-113:1986 and iso-ir-111, IBM880 with aliases 880, CP880 and EBCDIC-Cyrillic; INIS-cyrillic with alias iso-ir-51; ISO-8859-5 with aliases cyrillic, ISO-8859-5:1988 and iso-ir-144; KOI-7; KOI-8 with alias GOST_19768-74; KOI8-R; KOI8-RU and finally KOI8-U.

There seems to remain some confusion in Cyrillic charsets, and because a few users requested it repeatedly, recode now offers special services in that area. Consider these charsets as experimental and debatable, as the extraneous tables describing them are still a bit fuzzy or non-standard. Hopefully, in the long run, Cyrillic will be covered in Keld Simonsen's works to the satisfaction of everybody, and this section will merely disappear.

KEYBCS2
This charset is available under the name KEYBCS2, with Kamenicky as an accepted alias.
CORK
This charset is available under the name CORK, with T1 as an accepted alias.
KOI-8_CS2
This charset is available under the name KOI-8_CS2.

Easy French conventions

This charset is available in recode under the name Texte and has txte for an alias. It is a seven bits code, identical to ASCII-BS, save for French diacritics which are noted using a slightly different convention.

At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability over a few alternate ways of coding diacritics. Of course, it would better to have a specialised keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In a few mailing environments, and sadly enough, it still happens that the eight bit is often willing-fully destroyed.

Easy French has been in use in France for a while. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Qu'ebec originating from Universit'e de Montr'eal. In fact, the main problem for me was not to necessarily to invent Easy French, but to recognise the "best" convention to use, (best is not being defined, here) and to try to solve the main pitfalls associated with the selected convention. Shortly said, we have:

e'
for e (and some other vowels) with an acute accent,
e`
for e (and some other vowels) with a grave accent,
e^
for e (and some other vowels) with a circumflex accent,
e"
for e (and some other vowels) with a diaeresis,
c,
for c with a cedilla.

There is no attempt at expressing the ae and oe diphthongs. French also uses tildes over n and a, but seldomly, and this is not represented either. In some countries, : is used instead of " to mark diaeresis. recode supports only one convention per call, depending on the -c option of the recode command. French quotes (sometimes called "angle quotes") are noted the same way English quotes are noted in TeX, id est by " and ". No effort has been put to preserve Latin ligatures (ae, oe) which are representable in several other charsets. So, these ligatures may be lost through Easy French conventions.

The convention is prone to losing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is boosted into the recognition routines. So, the following subtleties are systematically obeyed by the various recognisers.

  1. A comma which follows a c is interpreted as a cedilla only if it is followed by one of the vowels a, o or u.
  2. A single quote which follows a e does not necessarily means an acute accent if it is followed by a single other one. For example:
    e'
    will give an e with an acute accent.
    e"
    will give a simple e, with a closing quotation mark.
    e"'
    will give an e with an acute accent, followed by a closing quotation mark.
    There is a problem induced by this convention if there are English quotations with a French text. In sentences like:
    There's a meeting at Archie's restaurant.
    
    the single quotes will be mistaken twice for acute accents. So English contractions and suffix possessives could be mangled.
  3. A double quote or colon, depending on -c option, which follows a vowel is interpreted as diaeresis only if it is followed by another letter. But there are in French several words that end with a diaeresis, and the recode library is aware of them. There are words ending in "igue", either feminine words without a relative masculine (besaigu@"e and cigu@"e), or feminine words with a relative masculine(14) (aigu@"e, ambigu@"e, contigu@"e, exigu@"e, subaigu@"e and suraigu@"e). There are also words not ending in "igue", but instead, either ending by "i"(15) ending by "e" (cano@"e) or ending by "u"(16) (Esa@"u). Just to complete this topic, note that it would be wrong to make a rule for all words ending in "igue" as needing a diaerisis, as there are counter-examples (becfigue, b`esigue, bigue, bordigue, bourdigue, brigue, contre-digue, digue, d'intrigue, fatigue, figue, garrigue, gigue, igue, intrigue, ligue, prodigue, sarigue and zigue).

Mule as a multiplexed charset

This version of recode barely starts supporting multiplexed or super-charsets, that is, those encoding methods by which a single text stream may contain a combination of more than one constituent charset. The only multiplexed charset in recode is Mule, and even then, it is only very partially implemented: the only correspondence available is with Latin-1. The author fastly implemented this only because he needed this for himself. However, it is intended that Mule support to become more real in subsequent releases of recode.

Multiplexed charsets are not to be confused with mixed charset texts (see section Using mixed charset input). For mixed charset input, the rules allowing to distinguish which charset is current, at any given place, are kind of informal, and driven from the semantics of what the file contains. On the other side, multiplexed charsets are designed to be interpreted fairly precisely, and quite independently of any informational context.

The spelling Mule originally stands for multilingual enhancement to GNU Emacs, it is the result of a collective effort orchestrated by Handa Ken'ishi since 1993. When Mule got rewritten in the main development stream of GNU Emacs 20, the FSF renamed it MULE, meaning multilingual environment in GNU Emacs. Even if the charset Mule is meant to stay internal to GNU Emacs, it sometimes breaks loose in external files, and as a consequence, a recoding tool is sometimes needed. Within Emacs, Mule comes with leim, which stands for libraries of emacs input methods. One of these libraries is named quail(17).


Go to the first, previous, next, last section, table of contents.