Hebrew characters in XML and XHTML

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Belorussian translation provided by Web hosting rating.
Spanish translation provided by Web hosting hub.

Quick links

For those who don't want to bother with explanations:

What are Hebrew characters?

This is not a simple question. In addition to Hebrew specific characters, languages that use the Hebrew script also use other characters, such as punctuation marks, digits, and symbols that are used by many other languages. To conduct a useful test of browser support one has to make an arbitrary decision what characters to include in the test. For example, what currency signs and mathematical symbols should be included in the test.

The tests here include the following characters:

This decision is based on two assumptions:

Organization of the characters

I have divided the characters into groups based on their appearance in legacy encodings, rather than their appearance in the Universal Character Set. This is useful for predicting support in browsers as many existing systems have support only for characters that belong to certain legacy encodings.

There are six groups. Within some groups finer division is introduced based on logical considerations.

There is a complete list of the characters that includes links to the relevant unicode charts, and information on how they may be encoded within Hebrew legacy encodings, and by XML or XHTML entities.

  1. Controls and basic Latin. This is the part of ASCII allowed in XML documents, thus excluding unused controls. They are divided into three subgroups:
    1. Controls allowed in XML
    2. XML markup delimiters. These characters may also appear as regular text.
    3. Other ASCII
  2. Additional Latin. These are all characters that appear in ISO-8859-1, ISO-8859-8, and Windows-1255, and are encoded with the same octet in all those three encodings, and are not included in the previous set of characters.
  3. Basic Hebrew. These are all the characters that appear both in ISO-8859-8 and Windows-1255, and are encoded with the same octet in both encodings, and are not included in the previous two sets. They are logically divided into three subgroups:
    1. Hebrew letters. The twenty two Hebrew letters and the five final letter forms.
    2. Formatting characters. These are bi-dierctional controls that may assist in ordering characters properly.
    3. Math symbols. These symbols also exist in ISO-8859-1, but are encoded with a different octet compared to the Hebrew legacy encodings.
  4. Additional Hebrew ISO. All the characters in ISO-8859-8 that are not included in the previous three sets. Note that these characters are not included in Windows-1255, thus, as opposed to a common myth, Windows-1255 is not an extention of ISO-8859-8.
  5. Additional Hebrew Windows. All the characters in Windows-1255 that are not included in the four previous sets. These are divided into three subgroups:
    1. Points and punctuation. Hebrew points and Hebrew specific punctuation marks.
    2. Yiddish diagraphs.
    3. Latin, symbols, and general punctuation. Some of which are often used in Hebrew such as quotation marks, and the new sheqel sign.
  6. Additional Hebrew Unicode. All the characters in the Universal Character Set that are identified as Hebrew characters, and are not included in the previous five sets. They are divided into three subgroups:
    1. Points and punctuation.
    2. Cantillation marks. Used mainly in the Hebrew Bible.
    3. Presentation forms. The first are combinations that have an alternative equivalent by combining a Hebrew letter or a Yiddish diagraph with one or more points. The other three subgroups are glyphic variants. These are divided by language.
      1. Letter-point combinations
      2. Traditional Hebrew
      3. Modern Hebrew
      4. Judeo-Spanish

The test pages

The test pages are marked with XHTML1.0. They are all served with an HTTP header identifying them as text/html and with a charset parameter identifying the encoding. In addition, the encoding is also identified in the XML declaration at the very begining of each document.

The tests are limited to displaying individual characters. Not all combinations of letters with points and cantillation marks are included. Also, the tests do not provide any insight to bi-directional support.