Contents
The Standard Generalized Markup Language (SGML, defined in [ISO8879]), is a language for defining markup languages. HTML is one such "application" of SGML.
An SGML application consists of several parts:
The SGML declaration for HTML 4.0 and the DTD for HTML 4.0 are included in this reference manual, along with the entity sets referenced by the DTD.
In this section, we discuss the syntax of HTML elements, attributes, and comments.
Character entities are numeric or symbolic names for characters that may be included in an HTML document. They are useful when your authoring tools make it difficult or impossible to enter a character you may not enter often. You will see character entities throughout this document; they begin with a "&" sign and end with a semi-colon (;).
We discuss HTML character entities in detail later in the section on the HTML document character set.
An SGML application defines elements that represent structures or desired behavior. An element typically consists of three parts: a start tag, content, and an end tag.
A element's start tag is written <element-name>, where element-name is the name of the element. An element's end tag is written with a slash before the element name: </element-name>. For example,
<pre>The content of the PRE element is preformatted text.</pre>
The SGML definition of HTML specifies that some HTML elements are not required to have end tags. The definition of each element in the reference manual indicates whether it requires an end tag.
Some HTML elements have no content. For example, the line break element BR has no content; its only role is to terminate a line of text. Such "empty" elements never have end tags. The definition of each element in the reference manual indicates whether it is empty (has no content) or, if it can have content, what is considered legal content.
Element names are always case-insensitive.
Elements are not tags. Some people refer incorrectly to elements as tags (e.g., "the P tag"). Remember that the element is one thing, and the tag (be it start or end tag) is another. For instance, the HEAD element is always present, even though both start and end HEAD tags may be missing in the markup.
Elements may have associated properties, called attributes, to which authors assign values. Attribute/value pairs appear before the final ">" of an element's start tag. Any number of (legal) attribute value pairs, separated by spaces, may appear in an element's start tag. They may appear in any order.
In this example, the align attribute is set for the H1 element:
<H1 align="center"> This is a centered heading thanks to the align attribute </H1>
By default, SGML requires you to delimit all attribute values using either double quotation marks (") or single quotation marks ('). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. You may also use numeric character entities to represent double quotes (") and single quotes ('). For double quotes you can also use the named character entity ".
In certain cases, it is possible in HTML to specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), and periods (ASCII decimal 46). We suggest using quotation marks even when it is possible to eliminate them.
Attribute names are always case-insensitive.
Attribute values are generally case-insensitive. The definition of each attribute in the reference manual indicates whether its value is case-insensitive.
HTML comments have the following syntax:
<!-- this is a comment --> <!-- and so is this one, which occupies more than one line -->
Comments must not be rendered by user agents as part of a document. Similary, user agents must not render SGML processing instructions (e.g., <?full volume>).
This specification presents pertinent fragments of the DTD each time an element or attribute is defined. Though cryptic and dissuasive at first, the DTD fragment gives concise information about an element and its attributes. We have chosen to include the DTD fragments in the specification rather than seek a more approachable, but longer and less precise means of describing an element. While almost all of the definitions include enough English text to make them comprehensible, for those who require definitive information, we complete this specification with a brief tutorial on reading the HTML DTD.
Certain HTML elements are said to be "block level" while others are "inline" (also known as "text level"). The distinction is founded on several notions:
Style sheets provide the means to specify the rendering of arbitrary elements, including whether an element is rendered as block or inline. In some cases, such as an inline style for list elements, this may be appropriate, but generally speaking, authors are discouraged from overriding the conventional interpretation of HTML elements in this way.
The alteration of the traditional presentation idioms for block level and inline elements also has an impact on the bidirectional text algorithm. See the section on the effect of style sheets on bidirectionality for more information.
In DTDs, comments may spread over one or more lines. In the DTD, comments are delimited by a pair of "--" marks, e.g.
<!ELEMENT PARAM - O EMPTY -- named property value -->Here, the comment "named property value" explains the use of the PARAM element. DTD comments for HTML do have not normative value.
The HTML DTD begins with a series of entity definitions. An entity definition (not to be confused with an SGML entity) defines a kind of macro that may be expanded elsewhere in the DTD. When the macro is referred to by name in the DTD, it is expanded into a string.
An entity definition begins with the keyword <!ENTITY % followed by the entity name, the quoted string the entity expands to, and finally a closing >. The following example defines the string that the %font entity will expand to.
<!ENTITY % font "TT | I | B | U | S | BIG | SMALL">
The string the entity expands to may contain other entity names. These names are expanded recursively. In the following example, the %inline entity is defined to include the %font, %phrase, %special and %formctrl entities.
<!ENTITY % inline "#PCDATA | %font | %phrase | %special | %formctrl">
You will encounter two DTD entities frequently in the HTML DTD: %inline and %block. They are used when the content model includes inline and block level elements respectively.
The bulk of the HTML DTD consists of the definitions of elements and their attributes. The <!ELEMENT> keyword begins an element definition and the > character ends it. Between these are specified:
In this example:
<!ELEMENT UL - - (LI)+>
This example illustrates the definition of an empty element:
<!ELEMENT IMG - O EMPTY>
The content model describes what may be contained by an element. Content definitions may include:
The content model use the following syntax to define what markup is allowed for the content of the element:
Here are some examples from the HTML DTD:
A few HTML elements use an additional SGML feature to exclude certain elements from content model. Excluded elements are preceded by a hyphen. Explicit exclusions override inclusions.
In this example, the -(A) signifies that the element A cannot be included in another A element (i.e., anchors may not be nested).
<!ELEMENT A - - (%text)* -(A)>
Note that the A element is part of the DTD entity %inline, but is excluded explicitly because of -(A).
Similarly, the following element definition for FORM prohibits nested forms:
<!ELEMENT FORM - - %block -(FORM)>
The <!ATTLIST> keyword begins the definition of attributes that an element may take. It is followed by the name of the element in question and a list of attribute definitions. An attribute definition is a triplet that defines:
In this example, the name attribute is defined for the MAP element. The attribute is optional for this element.
<!ATTLIST MAP name CDATA #IMPLIED >
The type of values permitted for the attribute is given as CDATA, an SGML data type. CDATA is text that may include character entities.
For more information about "CDATA", "NAME", "ID", and other data types, please consult the section on HTML data types.
The following examples illustrate possible attribute definitions:
rowspan NUMBER 1 -- number of rows spanned by cell -- http-equiv NAME #IMPLIED -- HTTP response header name -- id ID #IMPLIED -- document-wide unique id -- valign (top|middle|bottom|baseline) #IMPLIED
The rowspan attribute requires values of type NUMBER. The default value is given explicitly as "1". The optional http-equiv attribute requires values of type NAME. The optional id attribute requires values of type ID. The optional valign attribute is constrained to take values from the set {top, middle, bottom, baseline}.
Attribute definitions may also include DTD entities.
In this example, we see that the attribute definition list for the LINK element begins with the %attrs entity.
<!ATTLIST LINK %attrs; -- id, class, style, lang, dir, title -- href %URL #IMPLIED -- URL for linked resource -- ...more of the definition... >
The %attrs entity expands to:
<!ATTLIST P id ID #IMPLIED -- document-wide unique id -- class CDATA #IMPLIED -- comma list of class values -- style CDATA #IMPLIED -- associated style info -- title CDATA #IMPLIED -- advisory title/amplification -- lang NAME #IMPLIED -- [RFC1766] language value -- dir (ltr|rtl) #IMPLIED -- direction for weak/neutral text -- align (left|center|right|justify) #IMPLIED >
The %attrs entity has been defined for convenience since these seven attributes are defined for most HTML elements.
Simiarly, the DTD defines the %URL entity as expanding into the string CDATA.
<!ENTITY % URL "CDATA" -- The term URL means a CDATA attribute whose value is a Uniform Resource Locator, See [RFC1808] and [RFC1738] -->
As this example illustrates, the entity %URL provides readers of the DTD with more information as to the type of data expected for an attribute. Similar entities have been defined for %color, %Content-Type, %Length, %Pixels, etc.
Some attributes play the role of boolean variables (e.g., selected). Their appearance in the start tag of an element implies that the value of the attribute is "true". Their absence implies a value of "false".
Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected").
This example defines the selected attribute to be a boolean attribute.
selected (selected) #IMPLIED -- reduced interitem spacing --
The attribute is set to "true" by appearing in the element's start tag:
<OPTION selected="selected"> ...contents... <OPTION>
<OPTION selected>
instead of
<OPTION selected="selected">
Authors should be aware than many user agents only recognize the minimized form and not the full form.