Nir Dagan > Hebrew on the Web > Characters

Hebrew characters in XML and XHTML

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Belorussian translation provided by Web hosting rating.
Spanish translation provided by Web hosting hub.

Quick links

For those who don't want to bother with explanations:

What are Hebrew characters?

This is not a simple question. In addition to Hebrew specific characters, languages that use the Hebrew script also use other characters, such as punctuation marks, digits, and symbols that are used by many other languages. To conduct a useful test of browser support one has to make an arbitrary decision what characters to include in the test. For example, what currency signs and mathematical symbols should be included in the test.

The tests here include the following characters:

Characters that are specific to the Hebrew script.
All other characters that are available in the legacy Hebrew encodings ISO-8859-8 and Windows-1255.

This decision is based on two assumptions:

Legacy computer authoring systems such as word processors often support only a small set of characters of a particular legacy encoding, or at least make it difficult to include other charcaters. Thus, users of these systems will rarely use other characters.
The legacy encodings were designed in a sensible manner and thus include the vital non-Hebrew-specific characters that are used in languages that use the Hebrew script.

Organization of the characters

I have divided the characters into groups based on their appearance in legacy encodings, rather than their appearance in the Universal Character Set. This is useful for predicting support in browsers as many existing systems have support only for characters that belong to certain legacy encodings.

There are six groups. Within some groups finer division is introduced based on logical considerations.

There is a complete list of the characters that includes links to the relevant unicode charts, and information on how they may be encoded within Hebrew legacy encodings, and by XML or XHTML entities.

Controls and basic Latin. This is the part of ASCII allowed in XML documents, thus excluding unused controls. They are divided into three subgroups:
1. Controls allowed in XML
2. XML markup delimiters. These characters may also appear as regular text.
3. Other ASCII
Additional Latin. These are all characters that appear in ISO-8859-1, ISO-8859-8, and Windows-1255, and are encoded with the same octet in all those three encodings, and are not included in the previous set of characters.
Basic Hebrew. These are all the characters that appear both in ISO-8859-8 and Windows-1255, and are encoded with the same octet in both encodings, and are not included in the previous two sets. They are logically divided into three subgroups:
1. Hebrew letters. The twenty two Hebrew letters and the five final letter forms.
2. Formatting characters. These are bi-dierctional controls that may assist in ordering characters properly.
3. Math symbols. These symbols also exist in ISO-8859-1, but are encoded with a different octet compared to the Hebrew legacy encodings.
Additional Hebrew ISO. All the characters in ISO-8859-8 that are not included in the previous three sets. Note that these characters are not included in Windows-1255, thus, as opposed to a common myth, Windows-1255 is not an extention of ISO-8859-8.
Additional Hebrew Windows. All the characters in Windows-1255 that are not included in the four previous sets. These are divided into three subgroups:
1. Points and punctuation. Hebrew points and Hebrew specific punctuation marks.
2. Yiddish diagraphs.
3. Latin, symbols, and general punctuation. Some of which are often used in Hebrew such as quotation marks, and the new sheqel sign.
Additional Hebrew Unicode. All the characters in the Universal Character Set that are identified as Hebrew characters, and are not included in the previous five sets. They are divided into three subgroups:
1. Points and punctuation.
2. Cantillation marks. Used mainly in the Hebrew Bible.
3. Presentation forms. The first are combinations that have an alternative equivalent by combining a Hebrew letter or a Yiddish diagraph with one or more points. The other three subgroups are glyphic variants. These are divided by language.
  1. Letter-point combinations
  2. Traditional Hebrew
  3. Modern Hebrew
  4. Judeo-Spanish

The test pages

The test pages are marked with XHTML1.0. They are all served with an HTTP header identifying them as text/html and with a charset parameter identifying the encoding. In addition, the encoding is also identified in the XML declaration at the very begining of each document.

The tests are limited to displaying individual characters. Not all combinations of letters with points and cantillation marks are included. Also, the tests do not provide any insight to bi-directional support.

Test page encoded in US-ASCII.
Test page encoded in ISO-8859-1.
Test page encoded in ISO-8859-8.
Test page encoded in ISO-8859-8-i. This is an identical encoding to ISO-8859-8. It was introduced by RFC 1556 to distinguish between logical and visual storing order. As far as HTML and XML are concerned they are identical. In HTML all document are stored logically. See overview of standards for details.
Test page encoded in Windows-1255