Go to the first, previous, next, last section, table of contents.
The trivial surface consists into using a fixed number of bits
(often eight) for each character, the bits together hold the integer
value of the index for the character in its charset table. There are
many kinds of surfaces, beyond the trivial one, all having the purpose
of increasing selected qualities for the storage or transmission.
For example, surfaces might increase the resistance to channel limits
(Base64
), the transmission speed (gzip
), the information
privacy (DES
), the conformance to operating system conventions
(CR-LF
), the blocking into records (VB
), and surely other
things as well(18).
Many surfaces may be applied to a stream of characters from a charset,
the order of application of surfaces is important, and surfaces
should be removed in the reverse order of their application.
Even if surfaces may generally be applied to various charsets, some
surfaces were specifically designed for a particular charset, and would
not make much sense if applied to other charsets. In such cases, these
conceptual surfaces have been implemented as recode
charsets,
instead of as surfaces. This choice yields to cleaner syntax
and usage. See section The universal charset.
Surfaces are implemented within recode
as special charsets which may
only transform to or from the data
special charset. Clever users
may use this knowledge for writing surface names in requests exactly as if
they were pure charsets, when the only need is to change surfaces without
any kind of recoding between real charsets. In such contexts, data
may also be used as if it were some kind of generic, anonymous charset:
the request `data..surface' merely adds the given surface,
while the request `surface..data' removes it.
We are only beginning to experiment with surfaces in recode
, but
the concept opens the doors to many avenues, it is not clear yet which
ones are worth pursuing, and which should be abandoned. This chapter
presents all surfaces currently available.
A permutation is a surface transformation which reorders groups of eight-bit bytes. A 21 permutation exchanges pairs of successive bytes. If the text contains an odd number of bytes, the last byte is merely copied. An 4321 permutation inverts the order of quadruples of bytes. If the text does not contains a multiple of four bytes, the remaining bytes are nevertheless permuted as 321 if there are three bytes, 21 if there are two bytes, or merely copied otherwise.
21
recode
under the name
21-Permutation
and has swabytes
for an alias.
4321
recode
under the name
4321-Permutation
.
The same charset might slightly differ, from one system to another, for
the single fact that end of lines are not represented identically on all
systems. The representation for an end of line within recode
is the ASCII
or UCS
code with value 10, or LF. Other
conventions for representing end of lines are available through surfaces.
CR
ASCII
value 13. Unless the library is operating in strict mode,
adding or removing the surface will in fact exchange CR and
LF, for better reversibility. However, in strict mode, the exchange
does not happen, any CR will be copied verbatim while applying
the surface, and any LF will be copied verbatim while removing it.
This surface is available in recode
under the name CR
,
it does not have any aliases. This is the implied surface for the Apple
Macintosh related charsets.
CR-LF
CR-LF
surface will discard the first encountered C-z, which has
ASCII
value 26, and everything following it in the text.
Adding this surface will not, however, append a C-z to the result.
This surface is available in recode
under the name CR-LF
and has cl
for an alias. This is the implied surface for the IBM
or Microsoft related charsets or code pages.
Some other charsets might have their own representation for an end of
line, which is different from LF. For example, this is the case
of various EBCDIC
charsets, or Icon-QNX
. The recoding of
end of lines is intimately tied into such charsets, it is not available
separately as surfaces.
RFC 1521 defines two 7-bit surfaces, meant to prepare 8-bit messages for transmission. Base64 is especially usable for binary entities, while Quoted-Printable is especially usable for text entities, in those case the lower 128 characters of the underlying charset coincide with ASCII.
Base64
recode
under the name Base64
,
with b64
and 64
as acceptable aliases.
Quoted-Printable
recode
under the name
Quoted-Printable
, with quote-printable
and QP
as
acceptable aliases.
Note that UTF-7
, which may be also considered as a MIME surface,
is provided as a genuine charset instead, as it necessary relates to
UCS-2
and nothing else. See section Universal Transformation Format, 7 bits.
A little historical note, also showing the three levels of acceptation of Internet standards. MIME changed from a "Proposed Standard" to a "Draft Standard" in 1993, and only became a "Full Standard" during 1996-12.
Dumps are surfaces meant to express, in ways which are a bit more readable,
the bit patterns used to represent characters. They allow the inspection
or debugging of character streams, but also, they may assist a bit the
production of C source code which, once compiled, would hold in memory a
copy of the original coding. However, recode
does not attempt, in
any way, to produce complete C source files in dumps. User hand editing
or `Makefile' trickery is still needed for adding missing lines.
Dumps may be given in decimal, hexadecimal and octal, and be based over
chunks of either one, two or four eight-bit bytes. Formatting has been
chosen to respect the C language syntax for number constants, with commas
and newlines inserted appropriately.
However, when dumping two or four byte chunks, the last chunk may be incomplete. This is observable through the usage of narrower expression for that last chunk only. Such a shorter chunk would not be compiled properly within a C initialiser, as all members of an array share a single type, and so, have identical sizes.
Octal-1
recode
under the name Octal-1
,
with o1
and o
as acceptable aliases.
Octal-2
recode
under the name Octal-2
and has o2
for an alias.
Octal-4
recode
under the name Octal-4
and has o4
for an alias.
Decimal-1
recode
under the name Decimal-1
,
with d1
and d
as acceptable aliases.
Decimal-2
recode
under the name Decimal-2
and has d2
for an alias.
Decimal-4
recode
under the name Decimal-4
and has d4
for an alias.
Hexadecimal-1
recode
under the name Hexadecimal-1
,
with x1
and x
as acceptable aliases.
Hexadecimal-2
recode
under the name Hexadecimal-2
,
with x2
for an alias.
Hexadecimal-4
recode
under the name Hexadecimal-4
,
with x4
for an alias.
When removing a dump surface, that is, when reading a dump results back
into a sequence of bytes, the narrower expression for a short last chunk
is recognised, so dumping is a fully reversible operation. However, in
case you want do produce dumps by other means than through recode
,
beware that for decimal dumps, the library has to rely on the number of
spaces to establish the original byte size of the chunk.
Despite the library might report reversibility errors, removing a dump
surface is a rather forgiving process: one may mix bases, group more or
less numbers per source line, or use shorter chunks elsewhere than at the
far end. Also, source lines not beginning with a number are skipped. So,
recode
should often be able to read a whole C header file, wrapping
the results of a previous dump, and regenerate the original byte string.
A few pseudo-surfaces exist to generate debugging data out of thin air.
These surfaces are only meant for the expert recode
user, and are
only useful in a few contexts, like for generating binary permutations
from the recoding or acting on them.
Debugging surfaces, when removed, insert their generated data at the beginning of the output stream, and copy all the input stream after the generated data, unchanged. This strange removal constraint comes from the fact that debugging surfaces are usually specified in the before position instead of the after position within a request. With debugging surfaces, one often recodes file `/dev/null' in filter mode. Specifying many debugging surfaces at once has an accumulation effect on the output, and since surfaces are removed from right to left, each generating its data at the beginning of previous output, the net effect is an impression that debugging surfaces are generated from left to right, each appending to the result of the previous. In any case, any real input data gets appended after what was generated.
test7
test8
test15
UCS-2
values, like all codes from
the surrogate UCS-2
area (for UTF-16
), the byte order mark,
and values known as invalid UCS-2
.
test16
For an example, the command `recode l5/test8..dump < /dev/null' is a
convoluted way to produce an output similar to `recode -lf l5'. It says
to generate all possible 256 bytes and interpret them as ISO-8859-9
codes, while converting them to UCS-2
. Resulting UCS-2
characters are dumped one per line, accompanied with their explicative name.
Go to the first, previous, next, last section, table of contents.