SGML Declaration 

Contents

  1. The Document Character Set
    1. Data transfer
  2. The SGML Declaration

The Document Character Set 

The HTML 4.0 document character set, in the SGML sense, is the Universal Character Set (UCS) of [ISO10646]. Currently, this is code-by-code identical with the [UNICODE] standard.

Data transfer 

When HTML text is transmitted directly in UCS-2 (charset="UNICODE-1-1"), one must address the question of byte order: does the high-order byte of each two-byte character come first or second? This specification recommends that the UCS-2 be transmitted in big-endian byte order (high order byte first), which corresponds both to the established network byte order for two-byte quantities and to the Unicode ([UNICODE]) recommendation for serialized text data. Furthermore, to maximize chances of proper interpretation, it is recommended that documents transmitted as UCS-2 always begin with a ZERO-WIDTH NON-BREAKING SPACE character (hexadecimal FEFF) which, when byte-reversed becomes number FFFE, a character guaranteed to be never assigned. Thus, a user-agent receiving an FFFE as the first octets of a text would know that bytes have to be reversed for the remainder of the text.

The UTF-1 transformation format of [ISO10646] (registered by IANA as ISO-10646-UTF-1), should not be used.

The SGML Declaration 

   <!SGML  "ISO 8879:1986"
   --
        SGML Declaration for HyperText Markup Language version 4.0

        With support for Unicode UCS-4 and increased limits
        for tag and literal lengths etc.
   --

   CHARSET
            BASESET  "ISO Registration Number 177//CHARSET
                      ISO/IEC 10646-1:1993 UCS-4 with
                      implementation level 3//ESC 2/5 2/15 4/6"
            DESCSET  0   9     UNUSED
                     9   2     9
                     11  2     UNUSED
                     13  1     13
                     14  18    UNUSED
                     32  95    32
                     127 1     UNUSED
                     128 32    UNUSED
                     160 2147483486 160
   --
       In ISO 10646, the positions with hexadecimal
       values 0000D800 - 0000DFFF, used in the UTF-16
       encoding of UCS-4, are reserved, as well as the last
       two code values in each plane of UCS-4, i.e. all
       values of the hexadecimal form xxxxFFFE or xxxxFFFF.
       These code values or the corresponding numeric
       character references must not be included when
       generating a new HTML document, and they should be
       ignored if encountered when processing a HTML
       document.
   --

   CAPACITY        SGMLREF
                   TOTALCAP        150000
                   GRPCAP          150000
             ENTCAP         150000

   SCOPE    DOCUMENT
   SYNTAX
            SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
              17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
            BASESET  "ISO 646IRV:1991//CHARSET
                      International Reference Version
                      (IRV)//ESC 2/8 4/2"
            DESCSET  0 128 0

            FUNCTION
                     RE            13
                     RS            10
                     SPACE         32
                     TAB SEPCHAR    9

            NAMING   LCNMSTRT ""
                     UCNMSTRT ""
                     LCNMCHAR ".-"  -- ?include "~/_" for URLs? --
                     UCNMCHAR ".-"
                     NAMECASE GENERAL YES
                              ENTITY  NO
            DELIM    GENERAL  SGMLREF
                     SHORTREF SGMLREF
            NAMES    SGMLREF
            QUANTITY SGMLREF
                     ATTSPLEN 65536   -- These are the largest values --
                     LITLEN   65536   -- permitted in the declaration --
                     NAMELEN  65536   -- Avoid fixed limits in actual --
                     PILEN    65536   -- implementations of HTML UA's --
                     TAGLVL   100
                     TAGLEN   65536
                     GRPGTCNT 150
                     GRPCNT   64

   FEATURES
     MINIMIZE
       DATATAG  NO
       OMITTAG  YES
       RANK     NO
       SHORTTAG YES
     LINK
       SIMPLE   NO
       IMPLICIT NO
       EXPLICIT NO
     OTHER
       CONCUR   NO
       SUBDOC   NO
       FORMAL   YES
   >