This document tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context. It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding. ASCII, ISO 646, ISO 8859 (ISO Latin, especially ISO Latin 1), Windows character set, ISO 10646 (UCS), Unicode, UTF-8, UTF-7, MIME, and QP are used as examples. This document in itself does not contain solutions to practical problems with character codes; rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place.
This document is also available in Postscript form (generated by html2ps).
Eurodicautom contains translations and definitions for several technical terms related to character sets (e.g. glyph, encoding). You may wish to use the following simplified search form:
In computers and in data transmission between them, i.e. in digital data processing and transfer, data is internally presented as octets, as a rule. An octet is a small unit of data with a numerical value between 0 and 255, inclusively. The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially octal (base 8) or hexadecimal (base 16) notation. Octets are often called bytes. Internally, octets consist of eight bits (hence the name, from Latin octo 'eight'), but we need not go into bit level here.
Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit which presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.
In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known ASCII encoding and the ISO Latin family of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called HTTP headers.
Previously ASCII encoding was usually assumed by default (and it is still widely used). Nowadays ISO Latin 1, which can be regarded as an extension of ASCII, is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.
The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is often confusing.
Notice that a character code assumes or implicitly defines a character repertoire. A character encoding could, in principle, be viewed purely as a method of mapping a sequence of integers to a sequence of octets. However, quite often an encoding is specified in terms of a character code (and the implied character repertoire). The logical structure is still the following:
The phrase character set is used in a variety of meanings. Often it denotes just a character repertoire but it may also refer to a character code or even to a character encoding. See, for example OII's document Character Set Standards, which mentions several standards, some of which define a character code while others also specify a fixed encoding.
Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. A pulldown menu in a program might be labeled "Languages", yet consist of character encoding choices (only). A language setting is quite distinct from character issues, although naturally each language has its own requirements on character repertoire.
The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character repertoire, code, and encoding.
In fact, the definition of ASCII also defines a set of control codes ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):
! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~
There are actually several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X3.4-1986.
The international standard ISO 646 defines a character set similar to US-ASCII but with code positions corresponding to US-ASCII characters @[\]{|} as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII.
A large number of "national variants of ASCII" have been defined, assigning different letters and symbols to those positions. Thus, the characters which appear in those positions - including those in US-ASCII - are somewhat "unsafe" in international data transfer, although this problem is losing significance: the trend is towards using the corresponding codes strictly for US-ASCII meanings; national characters are handled otherwise, giving them their own, unique and universal code positions. But old software and devices may still reflect various "national variants of ASCII".
The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of character appearing in national variants are not intended to be exhaustive, just typical examples.
| dec | oct | hex | glyph | official Unicode name | National variants |
|---|---|---|---|---|---|
| 35 | 43 | 23 | # | number sign | # £ Ù |
| 36 | 44 | 24 | $ | dollar sign | ¤ |
| 64 | 100 | 40 | @ | commercial at | É § Ä à ³ |
| 91 | 133 | 5B | [ | left square bracket | Ä Æ ° â ¡ ÿ é |
| 92 | 134 | 5C | \ | reverse solidus | Ö Ø ç Ñ ½ ¥ |
| 93 | 135 | 5D | ] | right square bracket | Å Ü § ê é ¿ | |
| 94 | 136 | 5E | ^ | circumflex accent | Ü î |
| 95 | 137 | 5F | _ | low line | è |
| 96 | 140 | 60 | ` | grave accent | é ä µ ô ù |
| 123 | 173 | 7B | { | left curly bracket | ä æ é à ° ¨ |
| 124 | 174 | 7C | | | vertical line | ö ø ù ò ñ f |
| 125 | 175 | 7D | } | right curly bracket | å ü è ç ¼ |
| 126 | 176 | 7E | ~ | tilde | ü ¯ ß ¨ û ì ´ |
Almost all of the characters used in the national variants have been incorporated into ISO Latin 1. Many systems which support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.
More information about national variants and their impact:
The character code defined by the ASCII standard
is the following:
code values are assigned to characters consecutively
in the order in which the characters are listed above
(rowwise),
starting from 32 (assigned to the blank)
and ending up with 126 (assigned to the tilde
character ~).
The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. Notice that code numbers from 0 to 31 and code numbers from 127 upwards do not correspond to any printable character. (Code numbers 0 - 31 and 127 are reserved for control purposes. They have standardized names, but in fact their usage varies a lot.)
Octets 128 - 255 are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)
Sometimes the phrase "8-bit ASCII" is used. It is a misnomer used to refer to various character codes which are extensions of ASCII in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes.
The ISO 8859-1 standard (which is part of the ISO 8859 family of standards) defines a character repertoire identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as a character code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one octet.
In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255, and they are:
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Notes:
See also: The ISO Latin 1 character repertoire - a description with usage notes.
The MIME specification defines, among many other things, the general purpose "Quoted-Printable" (QP) encoding which can be used to present any sequence of octets as a sequence of such octets which correspond to ASCII characters. This implies that the sequence of octets becomes longer, and if it is read as an ASCII string, it can be incomprehensible to humans. But what is gained is robustness in data transfer, since the encoding uses only "safe" ASCII characters which will most probably get through any component in the transfer unmodified.
Basically, QP encoding means that
most octets smaller than 128 are used as such, whereas
larger octets and some of the small ones are presented as follows:
octet n is presented as
a sequence of three octets, corresponding to ASCII codes for
the = sign and the two digits of the
hexadecimal
notation of n.
If QP encoding is applied to a sequence of octets presenting
character data according to ISO 8859-1
character code, then
effectively this means that most ASCII characters
(including all ASCII letters)
are preserved as such whereas e.g. the ISO 8859-1 character ä
(code position 228 in decimal, E4 in hexadecimal)
is encoded as =E4.
(For obvious reasons, the equals sign = itself is among
the few ASCII characters which are encoded. Being in code position
61 in decimal, 3D in hexadecimal, it is encoded as =3D.)
Notice that encoding ISO 8859-1 data this way means that the character code is the one specified by the ISO 8859-1 standard, whereas the character encoding is different from the one specified (or at least suggested) in that standard. Since QP only specifies the mapping of a sequence of octets to another sequence of octets, it is a pure encoding and can be applied to any character data, or to any data for that matter.
Naturally, Quoted-Printable encoding needs to be processed by a program which knows it and can convert it to human-readable form. Roughly speaking, one can expect most E-mail programs to be able to handle QP, but the same does not apply to newsreaders (or Web browsers). Therefore, you should normally use QP in E-mail only.
In ISO 8859-1, code positions 128 - 159 are explicitly reserved for control purposes; they "correspond to bit combinations that do not represent graphic characters". This means that the so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) is not identical with ISO 8859-1. It is, however, true that the Windows character set is much more similar to ISO 8859-1 than the so-called DOS character sets are.
In Windows, some positions in the range 128 - 159 are assigned to printable characters, such as em dash, en dash, and trademark symbol. Thus, the character repertoire is larger than ISO Latin 1. The use of octets in the range 128 - 159 in any data to be processed by a program which expects ISO 8859-1 encoded data may have any effect. In the worst case, they are interpreted as control characters. See my document On the use of some MS Windows characters in HTML for a discussion of the problems of using these characters.
The Windows character set exists in
different variations, or "code pages".
What we
have discussed here is the most usual one, resembling ISO 8859-1.
Its name as
officially registered at IANA
is
ISO-8859-1-Windows-3.1-Latin-1
but the name windows-1252 is probably more widely used.
There are several character codes which are extensions to ASCII in the same sense as ISO 8859-1 and the Windows character set.
ISO 8859-1 itself is just a member of the ISO 8859 family of character codes, which extend the ASCII repertoire in different ways with different special characters (used in different languages and cultures). Just as ISO 8859-1 contains ASCII characters and a collection of characters needed in languages of western (and northern) Europe, there is ISO 8859-2 alias ISO Latin 2 constructed similarly for languages of central/eastern Europe, etc. The ISO 8859 character codes are isomorphic in the following sense: code positions 0 - 127 contain the same character as in ASCII, positions 128 - 159 are unused (reserved for control characters), and positions 160 - 255 are the varying part, used differently in different members of the ISO 8859 family.
Although ISO 8859-1 has been a de facto default encoding in many contexts, it has in principle no special role. And in practice, ISO 8859-15 alias ISO Latin 9 (!) will probably replace ISO 8859-1 to a great extent, since it contains the politically important symbol for euro.
The following table lists the ISO 8859 alphabets, with links to more detailed descriptions. There is a separate document Coverage of European languages by ISO Latin alphabets which you might use to determine which (if any) of the alphabets are suitable for a document in a given language or combination of languages. My other material on ISO 8859 contains a combined character table, too.
| standard | name of alphabet | characterization |
|---|---|---|
| ISO 8859-1 | Latin alphabet No. 1 |
"Western", "West European" |
| ISO 8859-2 | Latin alphabet No. 2 | "Central European", "East European" |
| ISO 8859-3 | Latin alphabet No. 3 | "South European"; "Maltese & Esperanto") |
| ISO 8859-4 | Latin alphabet No. 4 | "North European" |
| ISO 8859-5 | Latin/Cyrillic alphabet | (for Slavic languages) |
| ISO 8859-6 | Latin/Arabic alphabet | (for the Arabic language) |
| ISO 8859-7 | Latin/Greek alphabet | (for modern Greek) |
| ISO 8859-8 | Latin/Hebrew alphabet | (for Hebrew and Yiddish) |
| ISO 8859-9 | Latin alphabet No. 5 | "Turkish" |
| ISO 8859-10 | Latin alphabet No. 6 | "Nordic" (Sámi, Inuit, Icelandic) |
| ISO 8859-11 | Latin/Thai alphabet | (for the Thai language) |
|
(Part
12 has not been defined.) | ||
| ISO 8859-13 | Latin alphabet No. 7 | Baltic Rim |
| ISO 8859-14 | Latin alphabet No. 8 | Celtic |
| ISO 8859-15 | Latin alphabet No. 9 | "euro" |
Notes: ISO 8859-n is Latin alphabet no. n for n=1,2,3,4, but this correspondence is broken for the other Latin alphabets. ISO 8859-11 and ISO 8859-15 are at various draft stages; see Standards of JTC 1 / SC 2 for current status.
In addition, there are other extensions to ASCII which utilize the code range 0 - 255 ("8-bit ASCII codes"), such as
Notice that many of these are very different from ISO 8859-1. There may have different character repertoires, and the same character often has different code values in different codes. For example, code position 228 is occupied by ä (letter a with dieresis, or umlaut) in ISO 8859-1, by ð (Icelandic letter eth) in HP's Roman-8, by õ (letter o with tilde) in DOS code page 850, and by ∑ (summation symbol, similar to Greek letter sigma) in Macintosh character code.
For information about several code pages, see Code page &Co. by Roman Czyborra. See also his excellent description of various Cyrillic encodings, such as different variants of KOI-8; most of them are extensions to ASCII, too.
In general, full conversions between the character codes mentioned above are not possible. For example, the Macintosh character repertoire contains the Greek letter pi, which does not exist in ISO Latin 1 at all. Naturally, a text can be converted (a simple program which uses a conversion table) from Macintosh character code to ISO 8859-1 if the text contains only those characters which belong to the ISO Latin 1 character repertoire. Text presented in Windows character code can be used as such as ISO 8859-1 encoded data if it contains only those characters which belong to the ISO Latin 1 character repertoire.
ISO 10646 is an international standard (by ISO) which defines UCS, Universal Character Set, which is a very large character repertoire and a character code for it. Currently it contains 38 885 characters. It is an extension of the ISO Latin 1 character repertoire and the associated character code in the same sense as ISO Latin 1 is an extension of ASCII. For a list of the character blocks in the repertoire, with examples of some of them, see the document UCS (ISO 10646, Unicode) character blocks.
Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code identical with ISO 10646 and an encoding for it. In practice, people usually talk about Unicode rather than ISO 10646, partly because Unicode is more explicit about the meanings of characters, partly because Unicode charts of characters are available on the Web. (Unicode version 1.0 used somewhat different names for some characters than ISO 10646. In the current Unicode version, 2.0, the names have been made the same as in ISO 10646.)
Key resources on ISO 10646 and Unicode:
The Unicode encoding presents each code number as two consecutive octets m and n so that the number equals 256m+n. This means, to express it in computer jargon, that the code number is presented as a two-byte integer. This is a very obvious and simple encoding. However, it can be inefficient in terms of the number of octets needed. If we have normal English text or other text which contains ISO Latin 1 characters only, the length of the Unicode encoded octet sequence is twice the length of the string in ISO 8859-1 encoding.
It is somewhat debatable whether Unicode defines an encoding or just a character code. However, it refers to code values being presentable as 16-bit integers, and it seems to imply the corresponding two-octet representation. In principle, Unicode requires that "Unicode values can be stored in native 16-bit machine words" and "does not specify any order of bytes inside a Unicode value". Thus, it allows "little-endian" presentation where the least significant byte precedes the most significant byte, if agreed on by higher-level protocols.
ISO 10646 can be, and often is, encoded in other ways, too, such as the following encodings:
In the future, ISO 10646 may be extended to a character set so large that the Unicode encoding cannot be applied as such to it. But even in its current state, the ISO 10646 character repertoire can be regarded as a superset of most character repertoires in use. That is, practically all character repertoires can be regarded as subsets of the ISO 10646 repertoire. However, the code positions of characters vary from one character code to another.
The implementation of Unicode support is a long and mostly gradual process. It should be noted that Unicode can be supported by programs on any operating systems, although some systems may allow much easier implementation than others; this mainly depends on whether the system uses Unicode internally so that support to Unicode is "built-in".
Even in circumstances where Unicode is supported in principle, the support usually does not cover all Unicode characters. For example, a font available may cover just some part of Unicode which is practically important in some area. On the other hand, for data transfer it is essential to know which Unicode characters the recipient is able to handle. For such reasons, various subsets of the Unicode character repertoire have been and will be defined. For example, the Minimum European Subset specified by ENV 1973:1995 is intended to provide a first step towards the implementation of large character sets in Europe.
Unicode characters are often referred to using a notation of the
form
U+nnnn
where
nnnn is a four-digit
hexadecimal
notation of the
code value. For example, U+0020 means the space
character (with code value 20 in hexadecimal, 32 in decimal).
Notice that such notations identify a character through its
Unicode code value, without referring to any particular
encoding.
Olle Järnefors has written A short overview of ISO/IEC 10646 and Unicode, which is very readable and informative. (It also contains a more detailed technical description of the UTF encodings than those given above.)
The character concept is very fundamental for the issues discussed here but difficult to define exactly. The more fundamental concepts we use, the harder it is to give good definitions. (How would you define "life"? Or "structure"?) Here we will concentrate on clarifying the character concept by indicating what it does not imply.
The Unicode standard describes characters as "the smallest components of written language that have semantic value", which is somewhat misleading. A character such as a letter can hardly be described as having a meaning (semantic value) in itself. Moreover, a character such as ú (letter u with acute accent), which belongs to Unicode, can often be regarded as consisting of smaller components: a letter and a diacritic. And in fact the very definition of the character concept in Unicode is the following:
abstract character: a unit of information used for the organization, control, or representation of textual data.
(In Unicode terminology, "abstract character" is a character as an element of a character repertoire, whereas "character" refers to "coded character representation", which effectively means a code value. Confusing, isn't it?)
It is important to distinguish the character concept from the glyph concept. A glyph is a presentation of a particular shape which a character may have when rendered or displayed. For example, the character Z might be presented as a boldface Z or as an italic Z, and it would still be a presentation of the same character. On the other hand, lower-case z is defined to be a separate character - which in turn may have different glyph presentations.
This is ultimately a matter of definition: a definition of a character repertoire specifies the "identity" of characters, among other things. One could define a repertoire where uppercase Z and lowercase z are just two glyphs for the same character. On the other hand, one could define that italic Z is a character different from normal Z, not just a different glyph for it. In fact, in Unicode for example there are several characters which could be regarded as typographic variants of letters only, but for various reasons Unicode defines them as separate characters. For example, mathematicians use a variant of letter N to denote the set of natural numbers (0, 1, 2, ...), and this variant is defined as being a separate character ("double-struct capital N") in Unicode. There are some more notes on the identity of characters below.
When a character repertoire is defined (e.g. in a standard), some particular glyph is often used to describe the appearance of each character, but this should be taken as an example only. The Unicode standard specifically says (in section 3.2) that great variation is allowed between "representative glyph" appearing in the standard and a glyph used for the corresponding character:
Consistency with the representative glyph does not require that the images be identical or even graphically similar; rather, it means that both images are generally recognized to be representations of the same character. Representing the character U+0061 Latin small letter a by the glyph "X" would violate its character identity.
For more information on the glyph concept, consult the document Character/Glyph Model.
A repertoire of glyphs comprises a font. It is possible that a font which is used for the presentation of some character repertoire does not contain a different glyph for each character. For example, although characters such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha are regarded as distinct characters (with distinct code values) in Unicode, a particular font might contain just one A which is used to present all of them. (For information about fonts, there is a very large comp.font FAQ, but it does not seem to be actively maintained.)
You should never use a character just because it "looks right" or "almost right". Characters with quite different purposes and meanings may well look similar, or almost similar, in some fonts at least. Using a character as a surrogate for another for the sake of apparent similarity may lead to great confusion. Consider, for example, the so-called sharp s (es-zed), which is used in the German language. Some people who have noticed such a character in the ISO Latin 1 repertoire have thought "vow, here we have the beta character!". In many fonts, the sharp s (ß) really looks more or less like the Greek lowercase beta character (β). But it must not be used as a surrogate for beta. You wouldn't get very far with it, really; what's the big idea of having beta without alpha and all the other Greek letters? More seriously, the use of sharp s in place of beta would confuse text searches, spelling checkers, indexers, etc.; an automatic converter might well turn sharp s into ss; and some font might present sharp s in a manner which is very different from beta.
For some more explanations on this, see section Why should we be so strict about meanings of characters? in The ISO Latin 1 character repertoire - a description with usage notes.
There are some finer points as regards to the identity of characters. They are less severe than the above-mentioned abuse, but perhaps still worth mentioning here.
The definition of a character repertoire defines the identity of characters. For instance, the ASCII repertoire has a character called "hyphen". It is also used as a minus sign (as well as a substitute for a dash, since ASCII contains no dashes). Thus, that ASCII character is a generic, multipurpose character, and one can say that in ASCII hyphen and minus are identical. But in Unicode, there are distinct characters named "hyphen" and "minus sign" (as well as different dash characters). For compatibility, the old ASCII character is preserved in Unicode, too (in the old code position, with the name "hyphen-minus").
Similarly, as a matter of definition, Unicode defines characters for the micro prefix, n-ary product, etc., as distinct from the Greek letters (small mu, capital pi, etc.) they originate from. This is a logical distinction and does not necessarily imply that different glyphs are used. The distinction is important e.g. when textual data in digital form is processed by a program (which "sees" the code values, through some encoding, and not the glyphs at all). Notice that Unicode does not make any distinction e.g. between the lowercase Greek letter pi and the mathematical symbol pi denoting the well-known constant 3.14159... (i.e. there is no separate symbol for the latter); for the ohm sign, there is a specific character (in the Symbols Area), but it is defined as being equivalent to Greek letter capital omega (i.e. there are two separate characters but they are equivalent). On the other hand, it makes a distinction between the uppercase Greek letter pi and the mathematical symbol for "n-ary product". (If you think this doesn't sound quite logical, you are not the only one to think so.)
Typing characters on a computer may appear deceptively simple: you press a key labeled "A", and the character "A" appears on the screen. You also expect "A" to be included into a disk file when you save what you are typing, you expect "A" to appear on paper if you print your text, and you expect "A" to be sent if you send your product by E-mail or something like that.
Thus far, you should have learned that the presentation of a character in computer storage or disk or in data transfer may vary a lot. You have probably realized that especially if it's not the common "A" but something more special (say, an "A" with an accent), strange things might happen, especially if data is not accompanied with adequate information about its encoding.
But you might still be too confident. You probably expect that on your system at least things are simpler than that. If you use your very own very personal computer and press the key labeled "A" on its keyboard, then shouldn't it be evident that in its storage and processor, on its disk, on its screen its invariably "A"? Can't you just ignore its internal character code and character encoding? Well, probably yes - with "A". I wouldn't be so sure about "Ä", for instance.
When you press a key on your keyboard, then what actually happens is this. The keyboard sends the code of a character to the processor. The processor then, in addition to storing the data internally somewhere, normally sends it to the display device. Now, the keyboard settings and the display settings might be different from what you expect. Even if a key is labeled "Ä", it might send something else than the code of "Ä" in the character code used in your computer. Similarly, the display device, upon receiving such a code, might be set to display something different. Such mismatches are usually undesirable, but they are definitely possible.
Moreover, there are often keyboard restrictions. If your computer uses internally, say, ISO Latin 1 character repertoire, you probably won't find keys for all 191 characters in it on your keyboard.
Therefore, you often need program or operating system specific ways of entering characters from a keyboard, either because there is no key for a character you need or there is but it does not work (properly). Two important examples of such ways:
It is often possible to use various
"escape" notations for
characters. This somewhat vague term means notations which are
converted to characters according to some specific rules by some
programs. For example, in the
HTML language one can use the
notation Ä or, equivalently, the notation
Ä for the character Ä. To take another example,
in the
C programming language,
one can usually write \0304 to denote Ä within a string
constant, although this makes the program character code dependent.
In cases like these, the character itself does not occur in a file
(such as an HTML document or a C source program); instead, the file
contains the "escape" notation as a character sequence, which will
then be interpreted in a specific way by programs like
a Web browser or a C compiler.
It is hopefully obvious from the preceding discussion that a sequence of octets can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated.
Naturally, such a sequence could be intended to present other than character data, too; it could be an image in a bitmap format, or a computer program in binary form, or numeric data in the internal format used in computers.
This problem can be handled in different ways in different systems when data is stored and processed within one computer system. For data transmission, a platform-independent method of specifying the encoding and other relevant information is needed. Such methods exist, although they not always used widely enough. People still send each other data without specifying the encoding, and this may cause a lot of harm. Attaching a human-readable note, such as a few words of explanation in an E-mail message body, is better than nothing. But since data is processed by programs which cannot understand such notes, the encoding should be specified in a standardized computer-readable form.
Internet media types,
often called
MIME media types,
can be used to specify a
major media type ("top level media type", such as text),
a subtype (such as html), and an encoding
(such as
iso-8859-1).
They were originally developed to allow sending other
than plain ASCII data by E-mail.
They can be (and should be) used for specifying the encoding when
data is sent over a network, e.g. by E-mail or using the
HTTP protocol
on the World Wide Web.
The media type concept is defined in
RFC 2046.
The procedure for registering types in given in
RFC 2048;
according to
it, the registry is kept
by
IANA
at
ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/
For less authoritative but more readably presented information on media types, see e.g. document MIME Types by Chris Herborth.
The technical term used to denote a character encoding in the Internet media type context is "character set", abbreviated "charset".
Specifically, when data is sent in MIME format, the media type
and encoding are specified in a manner illustrated by the
following example:
Content-Type: text/html; charset=iso-8859-1
This specifies, in addition to saying that the media type is
text and subtype is html, that the
character encoding is ISO 8859-1
The official registry of "charset" (i.e., character encoding)
names,
with references to documents defining their meanings,
is kept by
IANA
at
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
Several character encodings have alternate names in the registry. For example, the basic (ISO 646) variant of ASCII can be called "ASCII" or "ANSI_X3.4-1968" or "cp367" (plus a few other names); the preferred name in MIME context is, according to the registry, "US-ASCII". Similarly, ISO 8859-1 has several names, the preferred MIME name being "ISO-8859-1". Unicode encoding is named "ISO-10646-UCS-2".
The Content-Type information
is an example of information in a
header.
Headers relate to some some data, describing its presentation and
other things, but is passed as logically separate from it.
Possible headers and their contents are defined in
the basic MIME specification,
RFC 2045.
Adequate headers
should normally be
generated automatically by the software which sends the data
(such as a program
for sending E-mail,
or a Web server) and interpreted automatically
by receiving software (such as a program for reading E-mail,
or a Web browser).
In E-mail messages, headers precede the message body; it depends
on the E-mail program whether and how it displays the headers.
For Web documents, a Web server is required to send headers when
it delivers a document to a browser (or other user agent) which
has sent a request for the document.
Basically, MIME should let people communicate smoothly without hindrances caused by character code and encoding differences. MIME should handle the necessary conversions automatically and invisibly.
For example, when person A sends E-mail to person B, the following should happen: The E-mail program used by A encodes A's message in some particular manner, probably according to some convention which is normal on the system where the program is used (such as ISO 8859-1 encoding on a typical modern Unix system). The program automatically includes information about this encoding into an E-mail header, which is usually invisible both when sending and when reading the message. The message, with the headers, is then delivered, through network connections, to B's system. When B uses his E-mail program (which may be very different from A's) to read the message, the program should automatically pick up the information about the encoding as specified in a header and interpret the message body according to it. For example, if B is using a Macintosh computer, the program would automatically convert the message into Mac's internal character encoding and only then display it. Thus, if the message was ISO 8859-1 encoded and contained the Ä (upper case A with dieresis) character, encoded as octet 196, the E-mail program used on the Mac should use a conversion table to map this to octet 128, which is the encoding for Ä on Mac. (If the program fails to do such a conversion, strange things will happen. ASCII characters would be displayed correctly, since they have the same codes in both encodings, but instead of Ä, the character corresponding to octet 196 in Mac encoding would appear - a symbol which looks like f in italics.)
Unfortunately, there are deficiencies and errors in software so that users often have to struggle with character code conversion problems, perhaps correcting the actions taken by programs.
Typical minor (!) problems which may occur in communication in Western European languages other than English is that most characters get interpreted and displayed correctly but some "national letters" don't. For example, character repertoire needed in German, Swedish, and Finnish is essentially ASCII plus a few letters like "ä" from the rest of ISO Latin 1. If a text in such a language is processed so that a necessary conversion is not applied, or an incorrect conversion is applied, the result might be that e.g. the word "später" becomes "spter" or "spÌter" or "spdter" or "sp=E4ter".
Sometimes you might be able to guess what has happened,
and perhaps to determine which code conversion should be applied,
and apply it more or less "by hand".
To take an example (which may have some practical value in itself
to people using languages mentioned)
Assume that you have some text data which is expected to be, say,
in German, Swedish or Finnish and which appears to be such text
with some characters replaced by oddities in a somewhat systematic
way. Locate some words which probably should contain "ä" but have
something strange in place of it (see examples above). Assume
further that the program you are using interprets text data
according to
ISO 8859-1
by default and that the actual data is
not accompanied with a suitable indication (like a
Content-Type header) of the encoding, or such an
indication is obviously in error. Now,
looking at what appears instead of "ä", we might guess:
| d |
The original data was actually ISO 8859-1 encoded or something similar (e.g. Windows encoded) but during data transfer the most significant bit of each octet was lost. (Such things may happen in systems for transferring, or "gatewaying", data from one network to another. Sometimes it might be your terminal that has been configured to "mask out" the most significant bit!) This means that the octet representing "ä" in ISO 8859-1, i.e. 228, became 228 - 128 = 100, which is the ISO 8859-1 encoding of letter d. |
|---|---|
| { |
Obviously, the data is in ASCII encoding so that the character "{" is used in place of "ä". Earlier it was common to use various national variants of ASCII, with characters #$@[\]^_`{|}~ replaced by national characters according to the needs of a particular language. Thus they modified the character repertoire of ASCII by dropping out some special characters and introducing national characters into their ASCII code positions. It requires further study to determine the actual encoding used, since e.g. Swedish, German and Finnish ASCII variants all have "ä" as a replacement for "{", but there are differences in other replacements. |
| ä |
The data is evidently in UTF-8 encoding. Notice that the characters à and ¤ stand here for octets 195 and 164, which might be displayed differently depending on program and device used. |
| Ì |
The data is most probably in Roman-8 encoding (defined by Hewlett-Packard). |
| =E4 |
The data is in Quoted-Printable encoding. The original encoding, upon which the QP encoding was applied, might be ISO 8859-1, or any other encoding which represents character "ä" in the same way as ISO 8859-1 (i.e. as octet 228 decimal, E4 hexadecimal). |
| ä |
The data is in HTML format; the encoding may vary. The notation ä is a so-called character entity reference. |
| ä |
The data is in HTML format; the encoding may vary. The notation ä is a so-called numeric character reference. (Notice that 228 is the code position for ä in Unicode.) |
| nothing |
Perhaps the data was encoded in DOS encoding (e.g. code page 850), where the code for "ä" is 132. In ISO 8859-1, octet 132 is in the area reserved for control characters; typically such octets are not displayed at all, or perhaps displayed as blank. If you can access the data in binary form, you could find evidence for this hypothesis by noticing that octets 132 actually appear there. (For instance, the Emacs editor on Unix would display such an octet as \204, since 204 is the octal notation for 132.) If, on the other hand, it's not octet 132 but octet 138, then the the data is most probably in Macintosh encoding. |
To illustrate what may happen
when text is sent in a grossly invalid form,
consider the following example.
I'm sending myself E-mail, using Netscape 4.0 (on Windows 95).
In the mail composition window, I set the encoding to
UTF-8.
The body of my message is simply
Tämä on testi.
(That's Finnish for 'This is a test'. The second and fourth
character is letter a with umlaut.)
Trying to read the mail on my Unix account, using the Pine E-mail
program (popular among Unix users), I see the following
(when in "full headers" mode; irrelevant headers omitted here):
X-Mailer: Mozilla 4.0 [en] (Win95; I)
MIME-Version: 1.0
To: Jukka.Korpela@hut.fi
Subject: Test
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=x-UNICODE-2-0-UTF-7
Content-Transfer-Encoding: 7bit
[The following text is in the "x-UNICODE-2-0-UTF-7" character set]
[Your display is set for the "ISO-8859-1" character set]
[Some characters may be displayed incorrectly]
T+O6Q- on testi.
Interesting, isn't it?
I specifically requested
UTF-8
encoding, but Netscape used UTF-7.
And it did not include a correct header, since
x-UNICODE-2-0-UTF-7 is not a
registered "charset" name.
Even if the encoding had been a registered one, there would
have been no guarantee that my E-mail program would have been
able to handle the encoding.
The example, "T+O6Q-" instead of "Tämä", illustrates what may
happen when an octet sequence is interpreted according to another
encoding than the intended one.
In fact, it is difficult to say what Netscape was really doing,
since it seems to encode incorrectly;
a correct UTF-7 encoding for "Tämä" would be
"T+AOQ-m+AOQ-".
(The "+" and "-" characters correspond to bytes indicating a switch
to "shifted encoding" and back from it.
The shifted encoding is based on presenting
Unicode
values first
as 16-bit binary integers, then regrouping the bits and presenting
the resulting
six-bit
groups as octets according to a table
specified in
RFC 2045 in the section on Base64.
See also
RFC 2152.)
Whenever text data is sent over a network, the sender and the recipient should have a joint agreement on the character encoding used. In the optimal case, this is handled by the software automatically, but in reality the users need to take some precautions.
Most importantly, make sure that any Internet software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: the header must be there and it must reflect the actual encoding used; and the encoding used must be one that is widely understood by the (potential) recipients' software. (One must often make compromises as regards to latter aim: you may need to use an encoding which is not yet widely supported to get your message through at all.
Learn to use your Web browser, newsreader, and E-mail program
so that you can retrieve the encoding information for the page,
article, or message you are reading.
(For example, on Netscape use View Page Info;
on News Xpress, use
View Raw Format; on Pine, use h)
If you use, say, Netscape to send E-mail or to post to Usenet news, make sure it sends the message in a reasonable form. In particular, make sure it does not duplicate the message by sending it both as plain text and as HTML (turn off the latter). As regards to character encoding, make sure it is something widely understood, such as ASCII, some ISO 8859 encoding, or UTF-8, depending on how large character repertoire you need.
In particular, avoid sending data in a proprietary encoding (like the Macintosh encoding or a DOS encoding) to a public network. (At the very least, if you do that, make sure that the message heading specifies the encoding!) There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers. But when sent to Internet, data should be converted to a more widely known encoding, by the sending program. If you cannot find a way to configure your program to do that, get another program.
As regards to other forms of transfer of data in digital form, such as diskettes, information about encoding is important, too. The problem is typically handled by guesswork. Often the crucial thing is to know which program was used to generate the data, since the text data might be inside a file in, say, the MS Word format which can only be read by (a suitable version of) MS Word or by a program which knows its internal data format. That format, once recognized, might contain information which specifies the character encoding used in the text data included; or it might not, in which case one has to ask the sender, or make a guess, or use trial and error - viewing the data using different encodings until something sensible appears.
I originally started writing this document as a tutorial for HTML authors. Later I noticed that this general information is extensive enough to be put into a document of its own. As regards to HTML specific problems, the document Using national and special characters in HTML summarizes what currently seems to be the best alternative in the general case.
Date of last update: 1998-12-04
Jukka Korpela, Jukka.Korpela@hut.fiAcknowledgements. I have learned a lot about character set issues from the following people: Timo Kiravuo, Timo T. Tuhkanen, Alan J. Flavell, Arjun Ray, Roman Czyborra. (But any errors in this document I souped up by myself.)