Soft hyphen (SHY) - a hard problem?

Summary

There seems to be a fundamental misunderstanding about the soft hyphen character (often abbrev. SHY, one HTML notation: ­). Although the ISO Latin 1 standard (ISO 8859-1) makes things perfectly clear, saying that it is a visible hyphen, to be used in a specific context, it is commonly regarded as hidden hyphenation hint. These two views are incompatible.

The misunderstanding is probably caused by the strong needs for hyphenation of Web documents. These needs are very real, but it is incorrect to try to answer to them in a manner which violates the character code standards. Moreover, explicit hyphenation hints can play only a very small role in the solution of the hyphenation problem, and the (mis)use of SHY would not even be the best way of giving hyphenation hints.

On the other hand, the HTML 4.0 specification defines SHY as a hyphenation hint, although in a manner which suggests that universal support to it is not to be expected.

Thus, ­ could be used at most as an occasional hyphenation hint in special cases, with the risk that it may not have such an effect and it may even be displayed as a normal hyphen in any context by some browsers.

What the ISO Latin 1 standard says

The ISO Latin 1 character code, also known as ISO 8859-1, and the ISO 8859 character sets in general, contain a character named soft hyphen, abbreviated SHY, code value 255 in octal. In general, the ISO 8859 standards specify the characters and their codes only, not the use of the characters. However, soft hyphen is one of the few exceptions.

The ISO 8859-1 standard defines, in section 6.3.3, both the graphic presentation and the usage of soft hyphen, as follows:

A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen, for use when a line break has been established within a word.

The ISO 10646 standard, which is the document character set for HTML, is an extension to ISO 8859-1. It contains no special remark about the soft hyphen.

Thus, soft hyphen is a visible (graphic) character, not an invisible hyphenation hint. Soft hyphen is not related to any word division process to be applied to the text but may indicate what has happened in such a process when the text was produced.

Such a character seems to appear in the EBCDIC character set, too, under the name "syllable hyphen", also abbreviated "SHY". It is described in a glossary by IBM as follows:

syllable hyphen
In the OfficeVision program, a hyphen used to divide a word at the end of a line; it may be removed when the OfficeVision program adjusts lines. Contrast with required hyphen.

As regards to the difference between a normal hyphen and a soft hyphen, the specification says that there might or might not be some visible difference. The difference in usage is that the soft hyphen is for a particular use - in practise, at the end of a line. The specification does not prohibit the use of normal hyphen there as well. Thus, it is possible but not mandatory to indicate whether a hyphen at the end of a line belongs to the word itself (and is to be presented even if the word is not divided into lines at that point), by using normal hyphen in that case and soft hyphen otherwise.

What HTML specifications HTML 2.0 and HTML 3.2 say

One might expect that SHY is pretty irrelevant in HTML documents, since normally one should not divide a word into two lines in HTML.

According to the HTML 2.0 specification (RFC 1866), the document character set of an HTML document must be ISO Latin 1 or some superset of it. (In HTML 3.2 the situation was clarified further that it shall be a specific superset of it, namely ISO 10646.) It specifies, in the section Characters, Words, and Paragraphs, the processing of characters.

The visible presentation involves reformatting, but of course any ISO Latin 1 character for which no specific treatment is mandated shall be presented as such. (Variations in sizes, font faces etc are allowed, of course.) There is no specific rule for a soft hyphen, just the following note:

Use of the non-breaking space and soft hyphen indicator characters is discouraged because support for them is not widely deployed.
Being a note, this does not introduce any additional rule. It implies that the two characters mentioned must be supported but warns authors (without relaxing requirements on browsers) that those characters may not be processed according to the specifications on all browsers.

The HTML 3.2 specification is partly less informative than HTML 2.0, so we must really assume the material in HTML 2.0 as the default when reading HTML 3.2. It mentions the soft hyphen in no other way than by introducing a quasi-symbolic notation, ­, for it, to be used as identical with the older ­ notation.

So what is the intended use of SHY in HTML documents? Should a line ending with SHY be treated so that effectively the SHY and the following newline are removed, restoring the integrity of a word that had been divided into lines? This would be natural, but there is no specific requirement on this and it does not logically follow from the definition of ISO Latin 1 (which does not prescribe in any way how the text should be processed in eventual reformatting - such issues are outside the scope of a character code standard).

The so-called internationalization activity has produced a short document on hyphenation which seems to take it for granted that SHY is a hyphenation hint. This is reflected in the document Internationalization of the Hypertext Markup Language (RFC 2070; dated January 1997) which contains the following:

NOTE - the soft hyphen character (U+00AD) needs special attention from user-agent implementers. It is present in many character sets (including the whole ISO 8859 series and, of course, ISO 10646), and can always be included by means of the reference ­. Its semantics are different from the plain hyphen: it indicates a point in a word where a line break is allowed. If the line is indeed broken there, a hyphen must be displayed at the end of the first line. If not, the character is not dispalyed at all. In operations like searching and sorting, it must always be ignored.
Strangely, this is presented in a note as if referring to a specification set up elsewhere, but without citing any source.

A review of some discussions

This section mainly reviews some discussions which the author found using InfoSeek.

In a discussion on a mailing list (HTML-WG) as early as in 1994, an article suggested two different interpretations:

A reply to this quoted the ISO formulation but draws the conclusion that the latter interpretation is true! Of course, both are wrong. The statement in ISO 8859-1 says two things about SHY (emphases mine): It does not say "can be established" but "has been established". This means, in particular, that when a program formats text, it can hyphenate words so that the formatted text carries information which tells whether a hyphen at the end of a line is part of a word itself (as in corn-crake) or was introduced when hyphenating for word division. (This means that if the formatted text is to be formatted again, the program could know how to restore words which have been divided.) This could be done so that the formatting process leaves a normal hyphen into the word when dividing corn-crake into corn- (at the end of a line) and crake (at the beginning of the next line) but introduces soft hyphen on other divisions.

Yet another interpretation of what SHY should mean was presented in an article which suggested that it might mean the following:

a real hyphen which is an allowed breakpoint, as in much­needed and as opposed to a non-breaking one like X-Windows (where you don't want an X- at the end of a line

In a news article Olle Jarnefors has commented several character issues and raised the following question:

- - the current definition of SHY - - says nothing about how to image the character when it is within a word on a line and not at the end of the line (which should be more frequent than the other situation).
He suggested that SHY should not be shown in other positions than at the end of a line.

My interpretation of the wording of ISO 8859-1 is that SHY shall definitely be presented as a hyphen, either identical with or similar to hyphen, in all positions. This is a requirement on programs which present ISO 8859-1 characters in visible form. The occurrence of SHY in other positions than at the end of a line is a violation of the standard by the person or program which produced the text, but such violation implies no change on the requirement of the standard. (As a remote analogue, if I violate the rules of the English language by misspelling a word, programs should not "fix" this by displaying my text differently from the requirements of the character code used. They may of course suggest changes to the text itself. Similarly, a Web browser might separately report that SHY is used incorrectly in the document, but it must still display SHY as a hyphen.)

The question of the nature of SHY seems to appear again and again in discussions about HTML. For example, the document Suggestion for hyphenation indications in HTML - <HYPH>, (May 1996) says:

The current situation in HTML is that the only possible way to specify to the client where a hyphenation can be done is by using the soft hyphen character.
Although the document then says that this does not work in practice (and suggests an alternative method for giving hyphenation hints) the statement still paints the wrong picture.

The soft hyphen issue has been discussed in the www-html list. Interestingly, an article was posted which quoted a message from the president of the Unicode consortium, saying just the following

The Unicode character 00AD is defined to be invisible, except at the end of a line, where it may or may not be visible, depending on the script.
(Here "00AD" means the character which has code 00AD in hexadecimal, i.e. the soft hyphen character.) As a commented in my reply, such a statement, despite its authoritative appearance, cannot settle the question. (See notes on the conflict between ISO 8859-1 and Unicode below.)

The HTML 4.0 view

It should be clear from the preceding arguments that the soft hyphen character is, per se, a visible character for a specific use which can hardly be frequent in HTML. On the other hand, the specification of a markup language like HTML can naturally define specific semantics for printable characters. Just as the less than sign (<) has a very special meaning, not derivable from the normal meaning of "less than", the soft hyphen might be defined to mean something which is more or less different from its meaning in ISO 8859. (These cases are not quite comparable, since the less than sign is the concrete syntax representation of SGML abstract syntax whereas the soft hyphen character would be separately defined to have a special meaning as a data character. Thus, the space character and the non-breaking space are closer analogues.)

In a sense, the HTML 4.0 specification is more explicit here than the HTML 2.0 and HTML 3.2 specifications. It says, in section Text, subsection Lines and Paragraphs, clause Hyphenation:

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (&#45; or &#x2D;). The soft hyphen is represented by the character entity reference &shy; (&#173; or &#xAD;)

In a sense, this clearly defines the "discretionary hyphen" semantics for the soft hyphen. (In fact, it might be clearer, since the phrase "where a line break can occur" is somewhat abstract and obscure; line breaks do not just "occur", they are produced by user agents.) And the character entity reference list even names the character as "soft hyphen = discretionary hyphen".

But notice the wording "Those browsers that interpret soft hyphens". Perhaps the intent is to refer to the fact that browsers need not divide words into lines at all, in which case they would ignore soft hyphens. But the obvious interpretation of the wording is that browsers need not "interpret soft hyphens" at all. In that case they would treat them as normal data characters having their normal meanings as set up in ISO 8859-1. This would imply that soft hyphens are displayed as hyphens.

Moreover, a simple test shows that (at least some versions of) Netscape 4.0 and Internet Explorer 4.0 basically treat soft hyphens as plain data characters which are always visible. IE may divide a word into two lines where a soft hyphen occurs, but it does the same for normal hyphens.

What the Unicode standard says

In the Unicode standard, version 2.0, soft hyphen is defined as follows:

U+00AD soft hyphen indicates a hyphenation point, where a line-break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line break occurs may differ (for example, in some scripts it is rendered as a hyphen -, while in others it may be invisible).

The standard also defines U+2027 hyphenation point, but that character is irrelevant for the discussion here. (It is a raised dot, resembling a middle dot and used to indicate correct word breaking, e.g. in dictionaries. It is definitely a printable character, and the very purpose of its use implies that it must be visible. Naturally, this makes it very clear that the soft hyphen is not intended to be a visible hyphenation hint to humans; the discussion revolves around the question whether it is intended to be an invisible hyphenation hint to computer programs.)

The code table in Unicode 2.0 mentions "discretionary hyphen" as an alternative name for soft hyphen. This, together with the definition quoted above, imply that the intention was to specify that soft hyphen is not a normal printable character but essentially an invisible hyphenation hint, which may cause a hyphen-like glyph to be rendered if some program actually divides the word into two lines at the suggested hyphenation point. One might say that this does not make the soft hyphen even a conditionally printable character, since an eventual glyph is used by the program to indicate what it has done. The appearance and form of such a glyph may or may not depend on whether hyphenation took place due to a hyphenation hint given by a soft hyphen or due to normal hyphenation rules (based on dictionary lookup or algorithmic rules or something else). Notice that the definition of soft hyphen in ISO 8859-1 strongly suggests that when a line break within a word has been established by a program applying hyphenation rules, it could use the (visible) soft hyphen character so that the situation can be distinguished (programmatically at least, perhaps visually too) from a hyphen that occurs at the end of a line for some other reason.

Thus, Unicode 2.0 seems to be in definite contradiction with ISO 8859-1 as regards to the definition and use of the soft hyphen characters. These important standards conflicting, and implementations being defective from both viewpoints, the practical conclusion is that authors should not use the soft hyphen character at all, for any purpose.

Concluding remarks: why did all this happen?

Hyphenation is one part of the problems of taking various (natural) languages and writing systems into account on the Web, an issue called internationalization (abbr. i18n) or localization, for some odd reason. It is a difficult issue, but not really among the most important in the area, except for languages with very long words and in contexts where text width is small.

Conceivably, people want fast solutions to their problems. When we see the problems which arise from Web browsers using no hyphenation, we pay attention to the worst cases and wish we could solve at least them, and solve them now. The simple idea of giving an explicit hyphenation hint suggests itself. Several popular text processing programs allow you to enter "hidden hyphenation hints" which are normally invisible. (Cf. information about hyphenation in PC Webopædia.) This is probably how people started thinking that there must be a character for the purpose, and when one looks at the ISO Latin 1 specification, what else could one use but soft hyphen?

Moreover, people may regard the notation &shy; as comparable to HTML tags. But of course it is defined simply as a notation which stands for a single character, the soft hyphen.


Jukka Korpela, Jukka.Korpela@hut.fi
Originally written in April - May 1997
Updated in December 1997 to discuss the effect of HTML 4.0.
Updated in March 1998 to add remarks on Unicode.
Updated in August 1998 to add a remark on syllable hyphen.