A brief SGML tutorial

Contents

This section of the document presents introductory information about SGML and its relationship to HTML. It discusses:

HTML syntax: How to write elements, attributes, and comments.
The HTML DTD: How to read the HTML DTD.

About SGML

The Standard Generalized Markup Language (SGML, defined in [ISO8879]), is a language for defining markup languages. HTML is one such "application" of SGML.

An SGML application consists of several parts:

The SGML declaration. The SGML declaration specifies which characters and delimiters may appear in the application.
The document type definition (DTD). The DTD defines the syntax of markup constructs. The DTD may include additional definitions such as numeric and named character entities.
A specification that describes the semantics to be ascribed to the markup. This specification also imposes syntax restrictions that cannot be expressed within the DTD.
Document instances containing data (contents) and markup. Each instance contains a reference to the DTD to be used to interpret it.

The SGML declaration for HTML 4.0 and the DTD for HTML 4.0 are included in this reference manual, along with the entity sets referenced by the DTD.

HTML syntax

In this section, we discuss the syntax of HTML elements, attributes, and comments.

Entities

Character entities are numeric or symbolic names for characters that may be included in an HTML document. They are useful when your authoring tools make it difficult or impossible to enter a character you may not enter often. You will see character entities throughout this document; they begin with a "&" sign and end with a semi-colon (;).

We discuss HTML character entities in detail later in the section on the HTML document character set.

Elements

An SGML application defines elements that represent structures or desired behavior. An element typically consists of three parts: a start tag, content, and an end tag.

A element's start tag is written <element-name>, where element-name is the name of the element. An element's end tag is written with a slash before the element name: </element-name>. For example,

<pre>The content of the PRE element is preformatted text.</pre>

The SGML definition of HTML specifies that some HTML elements are not required to have end tags. The definition of each element in the reference manual indicates whether it requires an end tag.

Some HTML elements have no content. For example, the line break element BR has no content; its only role is to terminate a line of text. Such "empty" elements never have end tags. The definition of each element in the reference manual indicates whether it is empty (has no content) or, if it can have content, what is considered legal content.

Element names are always case-insensitive.

Elements are not tags. Some people refer incorrectly to elements as tags (e.g., "the P tag"). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup.

Attributes

Elements may have associated properties, called attributes, to which authors assign values. Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. They may appear in any order.

In this example, the align attribute is set for the H1 element:

<H1 align="center">
This is a centered heading thanks to the align attribute
</H1>

By default, SGML requires you to delimit all attribute values using either double quotation marks (") or single quotation marks ('). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. You may also use numeric character entities to represent double quotes (") and single quotes ('). For double quotes you can also use the named character entity ".

In certain cases, it is possible in HTML to specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46). We suggest using quotation marks even when it is possible to eliminate them.

Attribute names are always case-insensitive.

Attribute values are generally case-insensitive. The definition of each attribute in the reference manual indicates whether its value is case-insensitive.

Note: HTML documents may compress better if you use lower case letters for element and attribute names. The reason is that the compression algorithms do a better job for more frequently repeated patterns, and lower case letters are more frequent than upper case ones.

HTML comments

HTML comments have the following syntax:

    
<!-- this is a comment -->
<!-- and so is this one,
    which occupies more than one line -->

Comments must not be rendered by user agents as part of a document. Similary, user agents must not render SGML processing instructions (e.g., <?full volume>).

How to read the HTML DTD

This specification presents pertinent fragments of the DTD each time an element or attribute is defined. Though cryptic and dissuasive at first, the DTD fragment gives concise information about an element and its attributes. We have chosen to include the DTD fragments in the specification rather than seek a more approachable, but longer and less precise means of describing an element. While almost all of the definitions include enough English text to make them comprehensible, for those who require definitive information, we complete this specification with a brief tutorial on reading the HTML DTD.

Block level and Inline elements

Certain HTML elements are said to be "block level" while others are "inline" (also known as "text level"). The distinction is founded on several notions:

Content model: Generally, block level elements may contain inline elements and other block level elements. Generally, inline elements may generally contain only data and other inline elements. Inherent in this structural distinction is the idea that block elements create "larger" structures than inline elements.
Formatting: By default, block level are formatted differently than inline elements. Block level elements generally begin on new lines, inline elements generally do not. Block level elements end an unterminated paragraph element. This enables you to omit end-tags for paragraphs in many cases.
Directionality: For technical reasons involving the [UNICODE] bidirectional text algorithm, block level and inline elements differ in how they inherit directionality information. For details, see the section on inheritance of text direction.

Style sheets provide the means to specify the rendering of arbitrary elements, including whether an element is rendered as block or inline. In some cases, such as an inline style for list elements, this may be appropriate, but generally speaking, authors are discouraged from overriding the conventional interpretation of HTML elements in this way.

The alteration of the traditional presentation idioms for block level and inline elements also has an impact on the bidirectional text algorithm. See the section on the effect of style sheets on bidirectionality for more information.

DTD Comments

In DTDs, comments may spread over one or more lines. In the DTD, comments are delimited by a pair of "--" marks, e.g.

<!ELEMENT PARAM - O EMPTY       -- named property value -->

Here, the comment "named property value" explains the use of the PARAM element. DTD comments for HTML do have not normative value.

Entity Definitions

The HTML DTD begins with a series of entity definitions. An entity definition (not to be confused with an SGML entity) defines a kind of macro that may be expanded elsewhere in the DTD. When the macro is referred to by name in the DTD, it is expanded into a string.

An entity definition begins with the keyword <!ENTITY % followed by the entity name, the quoted string the entity expands to, and finally a closing >. The following example defines the string that the %font entity will expand to.

<!ENTITY % font "TT | I | B | U | S | BIG | SMALL">

The string the entity expands to may contain other entity names. These names are expanded recursively. In the following example, the %inline entity is defined to include the %font, %phrase, %special and %formctrl entities.

<!ENTITY % inline "#PCDATA | %font | %phrase | %special | %formctrl">

You will encounter two DTD entities frequently in the HTML DTD: %inline and %block. They are used when the content model includes inline and block level elements respectively.

Element definitions

The bulk of the HTML DTD consists of the definitions of elements and their attributes. The <!ELEMENT> keyword begins an element definition and the > character ends it. Between these are specified:

The element's name.
Whether the element's end tag is optional. Two hyphens that appear after the element name mean that the start and end tags are mandatory. One hyphen followed by the letter "O" (not zero) indicates that the end tag can be omitted. A pair of letter "O"s indicate that both the start and end tags can be omitted.
The element's content, if any. The allowed content for an element is called its content model. Elements with no content are called empty elements. Empty elements are defined with the keyword "EMPTY".

In this example:

    <!ELEMENT UL - - (LI)+>

The element being defined is UL.
The two hyphens indicate that both the start tag and the end tag for this element are required.
The content model for this element defined to be "at least one LI element". We describe content models in detail below.

This example illustrates the definition of an empty element:

    <!ELEMENT IMG - O EMPTY>

The element being defined is IMG.
The hyphen and the following "O" indicate that the end tag can be omitted, but together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted.
The "EMPTY" keyword means the element must not have content.

Content model definitions

The content model describes what may be contained by an element. Content definitions may include:

The names of allowed or forbidden elements (e.g., the UL element includes instances of the LI element).
DTD entities (e.g., the LABEL element includes instances of the %inline entity).
Document text (indicated by the SGML construct "#PCDATA"). Text may contain numeric and named character entities. Recall that these begin with & and end with a semicolon (e.g., "Hergé's adventures of Tintin" includes the named entity for the "acute e" character).

The content model use the following syntax to define what markup is allowed for the content of the element:

( ... ): Specifies a group.
A | B: Both A and B are permitted in any order.
A , B: A must occur before B.
A & B: A and B must both occur once, but may do so in any order.
A?: A can occur zero or one times
A*: A can occur zero or more times
A+: A can occur one or more times

Here are some examples from the HTML DTD:

<!ELEMENT SELECT - - (OPTION+)>: The SELECT element must contain one or more OPTION elements.
<!ELEMENT DL - - (DT|DD)+>: The DL element must contain one or more DT or DD elements in any order.
<!ELEMENT OPTION - O (#PCDATA)*>: The OPTION element may only contain text and entities, such as &

A few HTML elements use an additional SGML feature to exclude certain elements from content model. Excluded elements are preceded by a hyphen. Explicit exclusions override inclusions.

In this example, the -(A) signifies that the element A cannot be included in another A element (i.e., anchors may not be nested).

   <!ELEMENT A - - (%text)* -(A)>

Note that the A element is part of the DTD entity %inline, but is excluded explicitly because of -(A).

Similarly, the following element definition for FORM prohibits nested forms:

   <!ELEMENT FORM - - %block -(FORM)>

Attribute definitions

The <!ATTLIST> keyword begins the definition of attributes that an element may take. It is followed by the name of the element in question and a list of attribute definitions. An attribute definition is a triplet that defines:

The name of an attribute.
The type of the attribute's value or an explicit set of possible values. Values defined explicitly by the DTD are case-insensitive.
Whether the default value of the attribute is implicit (keyword "#IMPLIED"), in which case the default value must be supplied by the user agent (in some cases via inheritance from parent elements); always required (keyword "#REQUIRED"); or fixed to the given value (keyword "#FIXED"). Some attributes explicitly specify a default value for the attribute.

In this example, the name attribute is defined for the MAP element. The attribute is optional for this element.

<!ATTLIST MAP
  name        CDATA     #IMPLIED
  >

The type of values permitted for the attribute is given as CDATA, an SGML data type. CDATA is text that may include character entities.

For more information about "CDATA", "NAME", "ID", and other data types, please consult the section on HTML data types.

The following examples illustrate possible attribute definitions:

rowspan     NUMBER     1         -- number of rows spanned by cell --
http-equiv  NAME       #IMPLIED  -- HTTP response header name  --
id          ID         #IMPLIED  -- document-wide unique id -- 
valign      (top|middle|bottom|baseline) #IMPLIED

The rowspan attribute requires values of type NUMBER. The default value is given explicitly as "1". The optional http-equiv attribute requires values of type NAME. The optional id attribute requires values of type ID. The optional valign attribute is constrained to take values from the set {top, middle, bottom, baseline}.

DTD entities in attribute definitions

Attribute definitions may also include DTD entities.

In this example, we see that the attribute definition list for the LINK element begins with the %attrs entity.

<!ATTLIST LINK
  %attrs;                          -- id, class, style, lang, dir, title --
  href        %URL       #IMPLIED  -- URL for linked resource --
  ...more of the definition...
  >

The %attrs entity expands to:

<!ATTLIST P
  id          ID         #IMPLIED  -- document-wide unique id --
  class       CDATA      #IMPLIED  -- comma list of class values --
  style       CDATA      #IMPLIED  -- associated style info --
  title       CDATA      #IMPLIED  -- advisory title/amplification -- 
  lang        NAME       #IMPLIED  -- [RFC1766] language value --
  dir         (ltr|rtl)  #IMPLIED  -- direction for weak/neutral text --
  align (left|center|right|justify)  #IMPLIED
  >

The %attrs entity has been defined for convenience since these seven attributes are defined for most HTML elements.

Simiarly, the DTD defines the %URL entity as expanding into the string CDATA.

<!ENTITY % URL "CDATA"
        -- The term URL means a CDATA attribute
           whose value is a Uniform Resource Locator,
           See [RFC1808] and [RFC1738]
        -->

As this example illustrates, the entity %URL provides readers of the DTD with more information as to the type of data expected for an attribute. Similar entities have been defined for %color, %Content-Type, %Length, %Pixels, etc.

Boolean attributes

Some attributes play the role of boolean variables (e.g., selected). Their appearance in the start tag of an element implies that the value of the attribute is "true". Their absence implies a value of "false".

Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected").

This example defines the selected attribute to be a boolean attribute.

selected     (selected)  #IMPLIED  -- reduced interitem spacing --

The attribute is set to "true" by appearing in the element's start tag:

<OPTION selected="selected">
...contents...
<OPTION>

Minimized boolean attributes In HTML, boolean attributes may be appear in "minimized form" -- the attribute's value appears alone in the element's start tag. Thus:

<OPTION selected>

instead of

<OPTION selected="selected">

Authors should be aware than many user agents only recognize the minimized form and not the full form.