Go to the first, previous, next, last section, table of contents.
Even if these charsets were originally added to recode
for
handling texts written in French, they find other uses. We did use them
lot for writing French diacriticised texts in the past, so recode
knows how to handle these particularly well for French texts.
This charset is available in recode
under the name HTML
,
with h4
as an acceptable alias.
HTML texts used by World Wide Web often use special sequences, beginning with an ampersand & and ending with a semicolon ;, for representing characters. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.
Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).
When you recode from another charset to HTML
, beware that all
occurrences of double quotes, ampersands, and left or right angle brackets
are translated into special sequences. However, in practice, people often
use ampersands and angle brackets in the other charset for introducing
HTML commands, compromising it: it is not pure HTML, not it is pure
other charset. These particular translations can be rather inconvenient,
they may be specifically inhibited through the command option -d
(see section Using mixed charset input).
Codes not having a mnemonic entity are output by recode
using the
`&#nnn;' notation, where nnn is a decimal representation
of the UCS code value. When there is an entity name for a character, it
is always preferred over a numeric character reference. ASCII printable
characters are always generated directly. So is the newline. While reading
HTML, recode
supports numeric character reference as alternate
writings, even when written as hexadecimal numbers, as in `�'.
This is documented in:
http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3
About levels of HTML, François Yergeau yergeau@alis.com writes:
http://www.alis.com:8085/ietf/html/html-latin1.sgml
. In addition,
four i18n-related entities were added: `‌' (`‌'),
`‍' (`‍'), `‎' (`‎') and `‏'
(`‏').
http://www.w3.org/TR/REC-html32.html
) took up the full
Latin-1 list but not the i18n-related entities from RFC 2070.
http://www.w3.org/TR/REC-html40/
) has the whole Latin-1 list,
a set of entities for symbols, mathematical symbols, and Greek letters,
and another set for markup-significant and internationalization characters
comprising the 4 ASCII entities, the 4 i18n-related from RFC 2070 plus
some more.
When recode
translates to HTML, the translation occurs according
to http://www.w3.org/TR/REC-html40/sgml/entities.html
.
It is also assumed that RFC 1866 has an equivalent contents. When
translating from HTML, recode
accepts some alternative
special sequences, to be forgiving when files use older HTML tables.
The recode
program can be used to normalise an HTML file using
oldish conventions. For example, it accepts `&AE;', as this once was a
valid writing, somewhere. However, it should always produce `Æ'
instead of `&AE;'. Yet, this is not completely true. If one does:
recode h3..h3 < input
the operation will be optimised into a mere copy, and you can get `&AE;' this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:
recode h3..u2,u2..h3 < input
then `&AE;' should be normalised into `Æ' by the operation.
This charset is available in recode
under the name LaTeX
and has ltex
as an alias. It is used for ASCII files coded to be
read by LaTeX or, in certain cases, by TeX.
Whenever you recode from another charset to LaTeX
, beware that all
occurrences of backslashes \ are translated into the string
`\backslash{}'. However, in practice, people often use backslashes
in the other charset for introducing TeX commands, compromising it:
it is not pure TeX, nor it is pure other charset. This translation
of backslashes into `\backslash{}' can be rather inconvenient,
it may be inhibited through the command option -d
(see section Using mixed charset input).
This charset is available in recode
under the name Texinfo
and has texi
and ti
for aliases. It is used by the GNU
project for its documentation. Texinfo files may be converted into Info
files by the makeinfo
program and into nice printed manuals by
the TeX system.
Even if recode
may transform other charsets to Texinfo, it may
not read Texinfo files yet. In these times, usages are also changing
between versions of Texinfo, and recode
only partially succeeds
in correctly following these changes. So, for now, Texinfo support in
recode
should be considered as work still in progress (!).
Some African character sets are available for a few languages, when these are heavily used in countries where French is also currently spoken.
One African charset is usable for Bambara, Ewondo and Fulfude, as well
as for French. This charset is available in recode
under the name
AFRFUL-102-BPI_OCIL
. Accepted aliases are bambara
, bra
,
ewondo
and fulfude
. Transliterated forms of the same are
available under the name AFRFUL-103-BPI_OCIL
. Accepted aliases
are t-bambara
, t-bra
, t-ewondo
and t-fulfude
.
Another African charset is usable for Lingala, Sango and Wolof, as well
as for French. This charset is available in recode
under the
name AFRLIN-104-BPI_OCIL
. Accepted aliases are lingala
,
lin
, sango
and wolof
. Transliterated forms of the same
are available under the name AFRLIN-105-BPI_OCIL
. Accepted aliases
are t-lingala
, t-lin
, t-sango
and t-wolof
.
To ease exchange with ISO-8859-1
, there is a charset conveying
transliterated forms for Latin-1 in a way which is compatible with the other
African charsets in this series. This charset is available in recode
under the name AFRL1-101-BPI_OCIL
. Accepted aliases are t-fra
and t-francais
.
The following Cyrillic charsets are already available in recode
through RFC 1345 tables: CP1251
with aliases 1251
,
ms-cyrl
and windows-1251
; CSN_369103
with aliases
ISO-IR-139
and KOI8_L2
; ECMA-cyrillic
with aliases
ECMA-113
, ECMA-113:1986
and iso-ir-111
, IBM880
with aliases 880
, CP880
and EBCDIC-Cyrillic
;
INIS-cyrillic
with alias iso-ir-51
; ISO-8859-5
with
aliases cyrillic
, ISO-8859-5:1988
and iso-ir-144
;
KOI-7
; KOI-8
with alias GOST_19768-74
; KOI8-R
;
KOI8-RU
and finally KOI8-U
.
There seems to remain some confusion in Cyrillic charsets, and because a few
users requested it repeatedly, recode
now offers special services
in that area. Consider these charsets as experimental and debatable, as
the extraneous tables describing them are still a bit fuzzy or non-standard.
Hopefully, in the long run, Cyrillic will be covered in Keld Simonsen's works
to the satisfaction of everybody, and this section will merely disappear.
KEYBCS2
KEYBCS2
, with
Kamenicky
as an accepted alias.
CORK
CORK
, with T1
as an accepted alias.
KOI-8_CS2
KOI-8_CS2
.
This charset is available in recode
under the name Texte
and has txte
for an alias. It is a seven bits code, identical
to ASCII-BS
, save for French diacritics which are noted using a
slightly different convention.
At text entry time, these conventions provide a little speed up. At read time, they slightly improve the readability over a few alternate ways of coding diacritics. Of course, it would better to have a specialised keyboard to make direct eight bits entries and fonts for immediately displaying eight bit ISO Latin-1 characters. But not everybody is so fortunate. In a few mailing environments, and sadly enough, it still happens that the eight bit is often willing-fully destroyed.
Easy French has been in use in France for a while. I only slightly adapted it (the diaeresis option) to make it more comfortable to several usages in Qu'ebec originating from Universit'e de Montr'eal. In fact, the main problem for me was not to necessarily to invent Easy French, but to recognise the "best" convention to use, (best is not being defined, here) and to try to solve the main pitfalls associated with the selected convention. Shortly said, we have:
There is no attempt at expressing the ae and oe diphthongs.
French also uses tildes over n and a, but seldomly, and this
is not represented either. In some countries, : is used instead
of " to mark diaeresis. recode
supports only one convention
per call, depending on the -c
option of the recode
command.
French quotes (sometimes called "angle quotes") are noted the same way
English quotes are noted in TeX, id est by " and ".
No effort has been put to preserve Latin ligatures (ae, oe)
which are representable in several other charsets. So, these ligatures
may be lost through Easy French conventions.
The convention is prone to losing information, because the diacritic meaning overloads some characters that already have other uses. To alleviate this, some knowledge of the French language is boosted into the recognition routines. So, the following subtleties are systematically obeyed by the various recognisers.
There's a meeting at Archie's restaurant.the single quotes will be mistaken twice for acute accents. So English contractions and suffix possessives could be mangled.
-c
option, which follows a
vowel is interpreted as diaeresis only if it is followed by another letter.
But there are in French several words that end with a diaeresis,
and the recode
library is aware of them. There are words ending in
"igue", either feminine words without a relative masculine (besaigu@"e
and cigu@"e), or feminine words with a relative masculine(14)
(aigu@"e, ambigu@"e, contigu@"e, exigu@"e, subaigu@"e and suraigu@"e).
There are also words not ending in "igue", but instead, either ending by
"i"(15)
ending by "e" (cano@"e) or ending by "u"(16)
(Esa@"u).
Just to complete this topic, note that it would be wrong to make a rule
for all words ending in "igue" as needing a diaerisis, as there are
counter-examples (becfigue, b`esigue, bigue, bordigue, bourdigue, brigue,
contre-digue, digue, d'intrigue, fatigue, figue, garrigue, gigue, igue,
intrigue, ligue, prodigue, sarigue and zigue).
This version of recode
barely starts supporting multiplexed or
super-charsets, that is, those encoding methods by which a single text
stream may contain a combination of more than one constituent charset.
The only multiplexed charset in recode
is Mule
, and even
then, it is only very partially implemented: the only correspondence
available is with Latin-1
. The author fastly implemented this
only because he needed this for himself. However, it is intended that
Mule support to become more real in subsequent releases of recode
.
Multiplexed charsets are not to be confused with mixed charset texts (see section Using mixed charset input). For mixed charset input, the rules allowing to distinguish which charset is current, at any given place, are kind of informal, and driven from the semantics of what the file contains. On the other side, multiplexed charsets are designed to be interpreted fairly precisely, and quite independently of any informational context.
The spelling Mule
originally stands for multilingual
enhancement to GNU Emacs, it is the result of a collective
effort orchestrated by Handa Ken'ishi since 1993. When Mule
got
rewritten in the main development stream of GNU Emacs 20, the FSF renamed
it MULE
, meaning multilingual environment
in GNU Emacs. Even if the charset Mule
is meant to stay
internal to GNU Emacs, it sometimes breaks loose in external files,
and as a consequence, a recoding tool is sometimes needed. Within Emacs,
Mule
comes with leim
, which stands for libraries
of emacs input methods. One of these libraries is
named quail
(17).
Go to the first, previous, next, last section, table of contents.