On the use of some MS Windows characters in HTML

The so-called MS Windows character set contains, in addition to ISO Latin 1 (ISO 8859-1) characters, some special characters like em dash, trademark symbol, and asymmetric quote characters. A Web author who works in a Windows environment may not realize that by using such characters he creates problems to most users who don't use Windows, and possibly to some Windows users, too. Typically, if an author naively types a trademark symbol, a browser running on Unix or some other non-Windows system will probably display a blank instead of the trademark symbol, or something worse. This document explains this problem in some detail, outlines the long-term solutions, and gives some suggestions for short-term solutions.

You should not try to use any of the following characters in HTML documents:
baseline single and double quote, florin, ellipsis, dagger, double dagger, circumflex accent, permile,
S and s Hacek, left and right single guillemet, OE and oe ligature, left and right single and double quote,
bullet, endash, emdash, tilde accent, trademark ligature, Y Dieresis

Content

Please don't get this wrong

There is nothing wrong with the characters discussed here. They have their legitimate uses, and they are, as characters, part of many other character repertoires too, such as Unicode and ISO 10646. The problem is that they cannot be presented in HTML reliably enough.

There is a very large number of useful characters in Unicode, but the great majority of them cannot be used in HTML documents with reasonably good success yet. The situation is improving, as the so-called internationalization of the Web proceeds and gets implemented. But currently we must live with rather limited character repertoires, unless we are willing to restrict accessibility radically. Compared with that, the need to use a hyphen instead of en dash, for example, is a rather small detail.

This document is not intended to make any judgment on the MS Windows systems themselves. And the characters discussed here can be used within and between Windows systems using the MS Windows character encoding.

The point is that on the World Wide Web, one should not expect that vendor-specific, system-dependent encodings work widely enough. And unfortunately, the standardized methods of presenting the characters under discussion do not work widely yet.

The main reason why the characters discussed here cause problems is that various attempts to present them create an illusion of working. When you create an HTML document and either consciously or unconsciously use, for example, the trademark symbol, you will probably see it right on your browser, and so will many others. But a large number of other people will see just a blank, or even have their display messed up by some control function. Although the trademark symbol probably looks somewhat better than the result of using a replacement (like HTML markup <SUP>(TM)</SUP>, which looks like the following on your current browser: (TM)), the gain is rather small as compared with the damage caused when the vendor-specific method of presenting the symbol does not work at all, i.e. information is lost. Of course, in some cases this might not matter so much while in others it can be quite serious (see the examples).

This is what you may get:

[This gas--argon--is inert.
See pp. 5-11.
Let“s all use UNIX(TM)!
Coeur de filet.] (approx.)

This is what many others get:

[This gas argon is inert.
See pp. 5 11.
Let s all use UNIX !
C ur de filet.]

Naturally, the warnings equally apply to any cross-platform transfer of data. However, when data is transferred to a known system - instead of being made accessible from any platform - one can often use a suitable character code conversion program. For example, when transferring text data from Windows to Macintosh, one can handle some of the characters discussed here, if one correctly converts from the Windows encoding to the Mac encoding.

The characters

The following table lists the characters we are discussing, i.e. the Windows characters which are not ISO Latin 1 characters. The Windows and ISO 10646 names as well as code numbers are given, Windows code in decimal and ISO 10646 code in hexadecimal.

"Special" Windows characters and their ISO 10646 equivalents
Windows name of character  ISO 10646 name of character           Win  Unicode
baseline single quote      single low-9 quotation mark           130  201A
florin                     Latin small letter f with hook        131  0192
baseline double quote      double low-9 quotation mark           132  201E
ellipsis                   horizontal ellipsis                   133  2026
dagger                     dagger                                134  2020
double dagger              double dagger                         135  2021
circumflex accent          modifier letter circumflex accent     136  02C6
permile                    per mille sign                        137  2030
S Hacek                    Latin capital letter S with caron     138  0160
left single guillemet      single left-pointing angle quot. m.   139  2039
OE ligature                Latin capital ligature OE             140  0152
left single quote          left single quotation mark            145  2018
right single quote         right single quotation mark           146  2019
left double quote          left double quotation mark            147  201C
right double quote         right double quotation mark           148  201D
bullet                     bullet                                149  2022
endash                     en dash                               150  2013
emdash                     em dash                               151  2014
tilde accent               small tilde                           152  02DC
trademark ligature         trade mark sign                       153  2122
s Hacek                    Latin small letter S with caron       154  0161
right single guillemet     single right-pointing angle quot. m.  155  203A
oe ligature                Latin small ligature oe               156  0153
Y Dieresis                 Latin capital letter Y with diaeresis 159  0178

Notes:

Where do these characters come from?

There are of course some reasons why the characters were are discussing were included into the "Windows character set" (as well to some other character repertoires). People who need a character tend to use it if they can. And many people are accustomed to using programs like MS Word where a large character repertoire is available. They usually just use any way of inserting special characters they need. (On MS Windows systems, a rather universal way of inserting the characters under discussion is the so-called Alt-nnnn method.) Normally they are satisfied when they see the characters presented on paper. So far so good.

The problem is that the internal encoding of the characters can be interpreted in different ways if the data is transferred to or processed in different programs and systems. For instance, if you use on Windows Alt-0151 to insert an em dash into a file and that file is transferred, without conversion, to a Unix system, anything may happen. Unix systems typically use some ISO 8859 encoding nowadays, and that means that the octet (byte) with value 151 in decimal is in the range reserved for control characters. Problems may occur even if you don't transfer the file to a different computer. If you use e.g. the type command on the file at the DOS level, you will see something like ú (letter u with acute accent) instead of em dash!

On the Web, people use different browsers on different systems. Therefore, anything you put onto the Web is thereby "virtually" transferred to a huge variety of systems. Consequently, an HTML document for the Web should not contain anything that works on some operating systems only, no matter how common they are.

The problematic characters are often produced by different programs, such as HTML editors or converters. Naturally, they shouldn't behave that way, but many of them actually do. It's often a good idea to check that output from such tools does not contain any octets (bytes) in the range 128 - 159 decimal (200 - 237 octal). (A very simple C program could do that, for example.)

Attempts to present the characters

The following table summarizes the most common attempts to present in HTML the characters we discuss here. For concreteness, the table shows examples of presenting a particular character, the em dash.

method example problems
"raw data" (octet with value 151 in decimal) roughly speaking, works on Windows browsers only
numeric character ref. using Windows code &#151; roughly speaking, works on Windows browsers only
(symbolic) entity reference &mdash; increasing support, but not very wide yet
correct numeric character reference &#2014; usually not supported yet
an image <IMG SRC="mdash.gif" ALT="--"> does not match the size of normal characters (except by accident)

Presenting a character as "raw data" simply means that the character is presented as an octet (byte) or a sequence of octets according to the encoding used for the document. This is how most characters are actually presented in HTML documents. There is nothing mystical about it. (If you type characters from a keyboard using an editor, what normally happens is that you actually enter characters as "raw data" in some encoding; in some cases, you use some special methods for entering characters when they cannot be directly typed.) The problem with the "raw data" method is that it works only for those browsers (and other user agents) which can handle data in the specific encoding used. There is a very a large number of registered character encodings (and many unregistered encodings, too). One can hardly expect Web browsers generally handle whatever encoding an author has decided to use. In fact, the ISO 8859-1 encoding is the only encoding which can reasonably be expected to be known to any browser. Although the Windows encoding is very widely used, it is usually not understood by browsers running under other than Windows systems; there are good reasons why it should, but factually it isn't. On the other hand, browsers running in Windows environment usually treat documents according to Windows encoding, if the server does not specify the encoding or if the encoding is specified to be ISO 8859-1.

In principle, if the "raw data" method is used, the server should send an HTTP header which specifies that the encoding used. When octets are to be interpreted according to the Windows encoding (e.g. octet 151 means em dash), the server should send
Content-Type:text/html;charset=ISO-8859-1-Windows-3.1-Latin-1
However, for reasons explained above, such headers usually don't make browsers process the data any better than they would be default.

In the long run, the problem will solved when browsers widely enough support the two methods which are defined in the HTML 4.0 Specification:

In an intranet where you can make sure that all browsers satisfy applicable requirements, you could use either the above-mentioned methods even today, and expect them to work in the future on the Web too. Naturally you should carefully test that your selected method actually works on browsers and browser setting (especially font settings) it needs to work. - It is true that intranet operability could often be achieved using the Windows-specific hacks, too. But it shouldn't be more difficult to insure that the method used works according to specifications and across platforms; and then you wouldn't need to worry in the future when you need to put the same documents to an extranet or the Internet, too.

There might be some other very special cases. If you need to include e.g. Greek and Cyrillic letters onto one page, then any methods for using such a large character repertoire (one of which is described in my document Using national and special characters in HTML ) at present considerably limit accessibility at present and the near future. If you have good reasons to do so, then you might as well include "smart quotes", em and en dashes, and other characters discussed here, naturally using the method you have selected to solve the fundamental problem. (When using "the most universal way" described in that document of mine, you would use &#2014; for em dash etc., using the Unicode positions mentioned in the list of characters above.)

For more detailed explanations of some of the problems, see ISO-8859 briefing and resources by Alan J. Flavell.

Various hacks have also been suggested, such as using a few no-break spaces within a STRIKE element to "construct" an em dash! I have prepared a small test file containing examples of and annotations on such attempts as well the above-mentioned methods.

Suggested substitutes

Whenever you need a character and can't use it, you need to consider substitutes. For the characters discussed here, relatively good substitutes can be found:

Suggested substitutes for "special" Windows characters
Windows name of char.  substitute       comments
baseline single quote  '                apostrophe used as single quote
florin                 <I>f</I>         letter f in italics
baseline double quote  "                quotation mark (double quote)
ellipsis               ...              three dots
dagger                 &sup1;           assuming use as footnote ref.
double dagger          &sup2;           assuming use as footnote ref.
circumflex accent      ^                circumflex
permile                o/oo             usual, but somewhat illogical
S Hacek                Sh or SH         language-dependent
left single guillemet  <                "<" used as "left angle bracket"
OE ligature            Oe or OE         natural due to what "ligature" means
left single quote      '                apostrophe used as single quote
right single quote     '                apostrophe used as single quote
left double quote      "                quotation mark (double quote)
right double quote     "                quotation mark (double quote)
bullet                 * or -           asterisk or hyphen
endash                 -                hyphen
emdash                 --               two hyphens
tilde accent           ~                tilde
trademark ligature     <SUP>(TM)</SUP>  (TM) in superscript style
s Hacek                sh               language-dependent
right single guillemet >                ">" used as "right angle bracket"
oe ligature            oe               natural due to what "ligature" means
Y Dieresis             IJ or Y          depending on intended meaning

Notes:


Date of last update: 1998-10-13

Note to Finnish readers: Tämä dokumentti on laajennettu versio suomenkielisestä dokumentistani Mikrojen merkistöjen aiheuttamista ongelmista Webissä.

Jukka Korpela, Jukka.Korpela@hut.fi