Standards for Hebrew on the Web

Characters, character encodings, and character references

As of December 1997 HTML 4 is a W3C recommendation. This recommendation adopts (with some improvements) the internationalization features introduced in RFC 2070 of January 1997. In particular, within HTML 4 one can write documents that use (simultaneously) a large range of different characters known as the Universal Character Set as defined in ISO-10646 (and Unicode). All characters used in Hebrew and Yiddish including points (nikud), accents (ta'amey hamikra), and Yiddish diagraphs are supported by HTML 4.

Documents may be encoded in any character encoding that the server and the browser agree upon. In order for a browser to process a document correctly, the server must provide a charset parameter with every document served. Clearly, this does not guarantee that the browser will be able to process the document, but it is necessary in order to identify the encoding.

Many character encodings, such as the various ISO-8859-n use a single octet to encode a single character. Thus, many characters cannot be encoded directly within these encodings. In order to allow encoding of characters outside those that can be encoded directly, HTML has a mechanism of references that allows to refer to characters using their code position in the Universal Character Set. Such character references may use decimal or hexadecimal notation. In addition, some characters also may be referred by a name entity references (which are easier to remember). These include some Latin characters, special symbols such as the copyright symbol, and mathematical symbols among others.

Hebrew characters may be encoded directly within the character encodings that allow for that, and by character references in all encodings. ISO-8859-8 and Windows-1255 are the most commonly used encodings for documents in languages based on the Hebrew script. All 27 Hebrew characters (22 letters and 5 final letters) may be encoded as single octets, within these encodings. Windows-1255 also allows to encode as a single octet points, Hebrew punctuation marks, and Yiddish diagraphs.

Bi-directional text

The Hebrew script is bi-directional, that is, some text is right-to-left (e.g., words) and some other (e.g., numbers) is left-to-right. HTML 4 supports bi-directional text as defined in Unicode's bi-directional algorithm. All text in a document is stored in logical order, namely the order in which one would normally type, and it is the responsibility of the displaying device (a browser or a printer, for example) to reorder the characters properly.

The bi-directional algorithm assigns bi-directional properties to each character and defines how characters in a single text block are displayed as a function of these properties and the base direction of the text block.

The base directionality of text blocks is marked in HTML with the dir attribute, that applies to most elements. The dir attribute also marks the directionality of table's columns. The dir attribute is inherited by text blocks, thus, when writing a document in Hebrew one would normally mark <html dir="rtl"> and would type all the text normally. An HTML 4 conforming browser will then display the text correctly. This applies even if the document contains some phrases in the Latin (or other left-to-right) script.

The bi-directional algorithm also allows to have embedings of strings of characters with different directionality within a paragraph. This is useful when there are high level nesting of quotations. Also these embedings are marked with the dir attribute.

In some rare cases the bi-directional algorithm fails to give the required result. The solution is using a special type of embeding called bi-directional override. Text enclosed in such an embeding will be rendered in a fixed order ignoring the bi-directional properties of the individual characters. In HTML a bi-directional override is marked with the bdo element. The dir attribute applied to this element specifies the fixed directionality of the enclosed text.

Prior to RFC 2070 (January 1997) there was no standard for writing right-to-left scripts in HTML. As most browsers at the time had no bi-directional support some Hebrew HTML documents were written in the so called "visual method." Within this method characters are stored in a way that will be displayed correctly by an application that renders all characters left-to-right. It is possible to write visual documents in a standards conforming way by overriding the directionality of essentially all the text in the document. As the bdo element is an inline element, this requires marking the override many times throughout the document. Writing visual documents without marking the override is an error.

ISO-8859-8 and bi-directionality

During the period the the Web existed without standards for Hebrew a certain convention (based on RFC 1556) has developed in order to identify whether a document requires bi-directional processing or should be processed "visually". The convention applies only to ISO-8859-8. Documents that require bi-directional handling are labeled with charset=iso-8859-8-i (where i stands for "implicit directionality") and "visual" documents are labeled with charset=iso-8859-8. However, in HTML the character encoding does not assign directionality in any way. Directionality is assigned to characters by Unicode's bi-directional algorithm, and additional HTML markup. Thus, when writing a "visual" document one must not rely on its charset labeling and must override directionality explicitly as HTML 4 requires.

In addition, authors should follow the charset convention for backward compatibility with old and non-standard browsers who rely on the labeling.

Language

As far as HTML and HTTP are concerned the language of a document is totally unrelated to the characters and the character encoding. HTTP allows to inform of the language of an intended audience of a document via the Content-Language header. This header may include more than one language. HTML allows marking the language of every piece of text using the lang attribute (or xml:lang in XML based versions of HTML) that applies to most HTML elements. Both HTTP and HTML use ISO's two-letters language codes to label languages.

Further reading

Essays and tutorials

Specifications

HTML versions with Hebrew characters and bi-directional support

General character related specifications

Other reference