Computer Programms used

at the Sternwarte Bonn


HTML TIDY - Release Notes

Dave Raggett dsr@w3.org

Public Email List for Tidy: <html-tidy@w3.org>

I have set up an archived mailing list devoted to Tidy. To subscribe send an email to html-tidy-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for this list is acccessible online.

15th April 1999

Another minor release. Jacob Sparre Andersen reports a bug with &quot; in attribute values. Now fixed. Francisco Guardiola reports problems when a body element follows the frameset end tag. I have fixed this with a patch to ParseHTML, ParseNoFrames and ParseFrameset in parser.c Chris Nappin wrote in with the suggestion for a config file option for enabling wrapping script attributes within embedded string literals. You can now do this using "wrap-script-strings: yes".

14th April 1999

Added check for Asp tags on line 2674 in parser.c so that Asp tags are not forcibly moved inside an HTML element. My thanks to Stuart Updegrave for this. Fixed problem with & entities. Bede McCall spotted that &amp; was being written out as &amp;amp;. The fix alters ParseEntity() in lexer.c

12th April 1999

Added a missing "else" on line 241 in config.c (thanks for Keith Blakemore-Noble for spotting this). Added config.c and .o to the Makefile (an oversight in the release on the 8th April).

8th April 1999

Localization:

All the message text is now defined in localize.c which should make it a tad easier to localize Tidy for different languages.

Config file support:

I have added support for configuring tidy via a configuration file. The new code is in config.h which provides a table driven parser for RFC822 style headers. The new command line option -config <filename> can be used to identify the config file. The environment variable "HTML_TIDY" may be used to name the config file. If defined, it is parsed before scanning the command line. You are advised to use an absolute path for the variable to avoid problems when running tidy in different directories.

Allan Kuchinsky:

Reports that the XML DOM parser by Eduard Derksen screws up on  , naked & and % in URLs as well as having problems with newlines after the '=' before attribute values.

I have tweaked PrintChar when generating XML to output   in place of &nbsp; and &amp; in place of &. In general XHTML when parsed as well-formed XML shouldn't use named entities other than those defined in XML 1.0. Note that this isn't a problem if the parser uses the XHTML DTDs which import the entity definitions.

Allan Odgaard:

When tidy encounter entities without a terminating semi-colon (e.g. "©") then it correctly outputs "©", but it doesn't report an error.

I have added a ReportEntityError procedure to localize.c and updated ParseEntity to call this for missing semicolons and unknown entities.

Andreas Buchholz:

Tidy warns if table element is missing. This is incorrect for HTML 3.2 which doesn't define this attribute.

The summary attribute was introduced in HTML 4.0 as an aid for accessibility. I have modified CheckTABLE to suppress the warning when the document type explicitly designates the document as being HTML 2.0 or HTML 3.2.

Andy Brown:

I have renamed the field from class to tag_class as "class" is a reserved word in C++ with the goal of allowing tidy to be compiled as C++ e.g. when part of a larger program.

I have switched to Bool and the values yes and no to avoid problems with detecting which compilers define bool and those that don't.

Andy would prefer a return code or C++ exception rather than an exit. I have removed the calls to exit from pprint.c and used a long jump from FatalError() back to main() followed by returning 2. It should be easy to adapt this to generate a C++ exception.

Sometimes the prev links are inconsistent with next links. I have fixed some tree operations which might have caused this. Let me know if any inconsistencies remain.

Ann Navarro:

Would like to be able to use:

   tidy file.html | more

to pause the screen output, and/or full output passing to file as with

   tidy file.html > output.txt

Tidy writes markup to stdout and errors to stderr. 'More' only works for stdout so that the errors fly by. My compromise is to write errors to stdout when the markup is suppressed using the command line option -e or "markup: no" in the config file.

html-kit@chamisplace.com

Writes asking for a single output routine for Tidy. Acting on his suggestion, I have added a new routine tidy_out() which should make it easier to embed HTML Tidy in a GUI application such as HTML-Kit. The new routine is in localize.c. All input takes place via ReadCharFromStream() in tidy.c, excepting command line arguments and the new config file mechanism.

Chami also asks for single routines for initializing and de-initializing Tidy, something that happens often from the GUI environment of HTML-Kit. I have added InitTidy() and DeInitTidy() in tidy.c to try to satisfy this need. Chami now supports an online interface for Tidy at the URL:

   http://www.chamisplace.com/asp/hk.asp

He further asks for Tidy to optionally output a length parameter whenever possible. This could represent the length of the element, attribute or code block related to the error. An online validator could then highlight the starting and ending columns which may be easier for beginners to understand, rather than pointing to a single character column. I will investigate this for a future release.

Chang Hyun Baek:

Reports a problem when generating XML using -iso2022. Tidy inserts ?/p< rather than </p>. I tried Chang's test file but it worked fine with in all the right places. Please let me know if this problem persists.

Christian Ruetgers:

When using -indent option Tidy emits a newline before which alters the layout of some tables.

I note that browsers aren't conforming to the SGML spec on generally ignoring a newline immediately after start tags and immediately before end tags. Netscape does this for pre elements but not for other tags! My work around is to avoid additional newlines for the content of th and td elements, except where their content starts with a block level element. This kind of thing is getting really hairy!

Christian Pantel:

Would like the servlet tag added to tidy. This looks very similar to applet and used for preprocessing document content before delivery. Servelet acts as a container for param elements and fallback content to be shown if the server doesn't support servlet. I have added it as a proprietary tag and parse it in the same way as applet.

Christian also reports that <td><hr/></td> caused Tidy to discard the <hr/> element. I have fixed the associated bug in ParseBlock.

Chuck Baslock:

Points out that an isolated & is converted to & in element content and in attribute values. This is in fact correct and in agreement with the recommendations for HTML 2.0 onwards.

Craig Horman:

Reports that Tidy loops indefinitely if a naked LI is found in a table cell. I have patched ParseBlock to fix this, and now successfully deal with naked list items appearing in table cells, clothing them in a ul.

Craig Johnson:

Reports that Tidy gets confused by </comment> before the doctype. This is apparently inserted by some authoring tool or other. I have patched Tidy to safely recover from the unrecognized and unexpected end tag without moving the parse state into the head or body.

Daniel Vogelheim:

Asks for Tidy to recognize obsolete elements such as LISTING and to replace them by more modern equivalents, in this case pre. I have added code to issue a warning and replace such elements as xmp, listing, plaintext by pre, and dir and menu by ul. Daniel also asks for a means to suppressing warnings, i.e. to only report errors. I have added the boolean "show-warnings" to the config file support to deal with this and split off warnings to ReportWarnings().

Dan Rudman:

Would love a version of Tidy written in Java. This is a big job. I am working on a completely new implementation of Tidy, this time using an object-oriented approach but I don't expect to have this done until later this year. DEFERRED

David Brooke:

Reports that when tidying an XMLfile with characters above 127 Tidy is outputting the numeric entity followed by the character. I have fixed this by a patch to PPrintChar() for XmlTags.

David Getchell:

Reports that Tidy thinks an ol list is HTML 4.0 when you use the type attribute. I have fixed an error in attrs.c to correct this feature to first appearing in HTML 3.2.

Drew Adams:

Reported problems when using comments to hide the contents of script elements from ancient browsers. I wasn't able to reproduce the problem, and guess I fixed it earlier.

Drew also reported a problem which on further investigation is caused by the very weird syntax for comments in SGML and XML. The syntax for comments is really error prone:

 <!--[text excluding --]--[[whitespace]*--[text excluding --]--]*>

This means that <!----> is a complete comment but <!------> is not since the parser is expecting a matching terminating -- and as it doesn't find the -- it ploughs on and on treating the rest of the markup as a comment unless it finds another end comment. I have added a rule of thumb (a heuristic) for detecting this situation. Basically I count the number of comment groups without other characters and if the count is > 2 and a '>' is seen, a warning is generated.

Drew goes on to comment on the -clean option. This made me take another look at the relative font sizes I am using for the absolute font sizes for 0 through 6. I have tweaked them to get a reasonable match before/after applying -clean as viewed on NS4 and IE4. Font size=3 is taken as the normal body font size and as such the font element is silently dropped unless it also defines a color.

I have also added InlineStyle to deal with the cases where an inline element has as its ownly child a font element. A further possibility would be to promote style properties common to all children of an element to the element. I will have to leave this for future work.

Drew asks why </ is not allowed in script content. The answer is that SGML treats </ as delimiting the end of CDATA element content, so that it ends prematurely before the </script> end tag. Browsers tend not to follow the SGML standard in this respect, but Tidy is designed to help you do so.

Guus Goos:

Notes that tidy *.html doesn't work under DOS. This is because DOS unlike Unix doesn't expand names with wildcards to the list of matching file names. This is a right nuisance and one more reason why Linux is gaining popularity. I plan to provide a work around in a future release of Tidy. Are there any free drop-in replacements for the DOS shell that fix this problem?

Jack Horsfield:

Like a number of others would like list items and table cells to be output compactly where possible. I have added a flag to avoid indentation of content to tags.c that avoids further indentation when the content is inline, e.g.

 <ul>
   <li>some text</li>
   <li>
     <p>
        a new paragraph
     </p>
   </li>
 </ul>

This behavior is enabled via "smart-indent: yes" and overrides "indent: no". Use "indent-spaces: 5" to set the number of spaces used for each level of indentation.

Jeff Young:

Has a few suggestions that will make Tidy work with XSL. Thanks, I have incorporated all of them into the new release.

Jelks Cabaniss:

Reports that the Tidy thinks the end tag is missing if the script element has no content. I have patched ParseScript to fix this. Jelks also asks for a way to ask Tidy to hide the contents of script and style elements; a way to avoid promoting inline styles with -clean to style rules as a work around for a bug in IE for URLs with relative URLs; finally, a way to avoid empty elements being discarded, especially if they define an ID for scripting. Very reasonable, but I would prefer leave these to a future release. (This release is big enough right now!).

One thing I can satisfy right away is a mailing list for Tidy. html-tidy@w3.org has been created for discussing Tidy and I have placed the details for subscribing and accessing the Web archive on the Tidy overview page.

Johannes Koch:

Reports that Tidy isn't quite right about when it reports the doctype as inconsistent or not. I have tweaked HTMLVersion() to fix this. Let me know if any further problems arise.

John Tobler:

Wants to know how to get Tidy to preserve his explicit entities e.g. " and  . Currently Tidy interprets all entities as character values and as such has no way to distinguish whether these were derived from entities or not. To help John with this release you can use "quote-marks: yes" in the config file if you want all " marks to appear as " and "quote-nbsp: yes" if you want non-breaking spaces to be shown as entities. Note that for XML in general   is not-predeclared, so you should also use "numeric-entities: yes". This doesn't apply to XHTML though.

John also reports that the weirdly complex URLs using the javascript: scheme as used by www.bookmarklets.com can cause Tidy indigestion. I have made Tidy aware of which attributes are using Javascript and disabled the missing quotemark heuristic for these. I have also tweaked the way unknown entities are reported to say that the markup have contain unescaped ampersands.

Mathew Cepl:

Notes that dir and menu are deprecated and not allowed in HTML4 strict. I have updated the entry in the tags table for these two. I also now coerce them automatically to ul when -clean is set.

Maurice Buxton:

Reports that some implementations of gcc don't work with the current compiler directive Tidy uses to avoid duplicate typedefs for uint and ulong. I don't have a truly platform independent solution for this, so you may need to edit platform.h if the code doesn't compile out of the box on your platform.

Osma Ahvenlampi:

Found that Tidy is confused by map elements in the head. Tidy knows that map is only allowed in the body and thinks the author has left out the start tag. Thereafter elements which it knows only belong in the head are moved to the head, so things should work out ok.

Osma also reports having difficulties with non-breaking spaces, but I was unable to reproduce these with the new release of Tidy, so perhaps the problems have been fixed.

Paul Ward:

Reports that Tidy caused Javascript errors when it introduced linebreaks in Javascript attributes. Tidy goes to some efforts to avoid this and I am interested in any reports of further problems with the new release.

Rafi Stern:

Would like Tidy to warn when a tag has an extra quote mark, as in <a href="xxxxxx"">. I have patched ParseAttribute to do this.

Rene Fritz:

Reported a space being inserted at the end of lines when a the text is wrapped at the start of hypertext links. This isn't occurring with this release, so I guess the problem was solved a while back. Rene also suggests that Tidy could be used to add and remove metadata and attributes etc. for a group of files, e.g. to add a link to a style sheet or to assert attribution. This sounds like a good idea for work in the future.

Shane McCarron:

Reports that Tidy sometimes wraps text within markup that occurs in the context of a pre element. I am only able to repeat this when the markup wraps within start tags, e.g. between attribute values. This is perfectly legitimate and doesn't effect rendering.

Steven Lobo:

Notes that Tidy doesn't remove entities such as &nbsp; or &copy; which aren't defined by XML 1.0. That is true - these entities are fine if you are using XHTML. If you want to generate generic XML then you need to use the -n option or to set "numeric-entities: yes" in the config file. This will then output all such entities in their numeric form or as direct character values according to the character encoding flags.

Steven Pemberton:

Comments that he would like Tidy to replace naked & in URLs by &. You can now use "quote-ampersands: yes" in the config file to ensure this. Note that this is always done when outputting to XML where naked '&' characters are illegal.

Steven also asks for a way to allow Tidy to proceed after finding unknown elements. The issue is how to parse them, e.g. to treat them as inline or block level elements? The latter would terminate the current paragraph whereas the former would not.

If treated as inline, presumably, unknown tags should be treated specially, for instance, normal inline end tags close the currently open inline element, but this doesn't feel right for unknown tags. What should the content model for unknown tags be - flow? Again its far from obvious. One way to avoid these difficulties would be to provide a means for authors to declare unknown tags in the config file.

You can now declare new inline and block-level tags in the config file, e.g.:

define-inline-tags: foo, bar
define-blocklevel-tags: blob

The content model for new tags allows for block or inline content. Steven further comments that some authors use ul without an li to indent content. Tidy currently coerces these to wrap the content within an li which alters the rendering. He suggests using blockquote instead. I have done this, and if you use the -clean option at the same time, it gets replaced by a div element with a class and style rule for indenting the content.

Stuart Updegrave:

Would like to be able to coerce attributes to uppercase. I have added support for "uppercase-attributes: yes" for this. Stuart also asks for Tidy to support Microsoft's ASP tags. These are part of Microsoft's server-side scripting model (similar to CGI). I have treated ASP tags in the same way as processing instructions, and they don't effect the version of HTML as they are assumed to have been interpreted before delivery to the client.

Stuart is also interested in having Tidy reading from and writing back to the Windows clipboard. This sounds interesting but I have to leave this to a future release.

Terry Cassidy:

Points out that Tidy doesn't like "top" or "bottom" for the align attribute on the caption element. I have added a new routine to check the align attribute for the caption element and cleaned up the code for checking the document type.

Xavier Plantefeve:

Suggests that I should ensure that the options are self consistent, e.g. if -asxml is set, then this should imply lower case and override any instruction to omit optional end tags. Accordingly, I have introduced a new routine AdjustConfig() that is applied after reading the command line and config files and before tidying any files.

Xavier wonders whether name attributes should be replaced or supplemented by id attributes when translating HTML anchors to XHTML. This is something I am thinking about for a future release along with supplementing lang attributes by xml:lang attributes.

Zdenek Kabelac:

Asks for headings and paragraphs to be treated specially when other tags are indented. I have dealt with this via the new smart-indent mechanism.

22nd February 1999

Tidy can now fix up XML empty tags for which the attribute values are unquoted, e.g. <br clear=all/>. Care is taken to avoid this being applied to tags with URLs, e.g. <a href=http://acme.com/> where the / is part of the attribute value and doesn't signify an empty tag. Authors are advised to always quote attribute values to avoid such problems!

22nd January 1999

Tidy no longer complains about a missing </tr> before a <tbody>. Added link to a free win32 GUI for tidy.

11th January 1999

Added a link to the OS/2 distribution of Tidy made available by Kaz SHiMZ. No changes to Tidy's source code.

7th January 1999

Fixed bug in ParseBlock that resulted in nested table cells.

Fixed clean.c to add the style property "text-align:" rather than "align:".

Disabled line wrapping within HTML alt, content and value attribute values. Wrapping will still occur when output as XML.

16th December 1998

This release fixes a problem with missing quotemarks in attribute values introduced in the December 14th release. It also fixes problems with parsing tables when the table cells include naked list items and when unexpected end tags are encountered for td and tr cells. Warnings are now generated for unknown entities (those not defined by HTML 4.0). It may be worth thinking about a new option to determine how to handle these, especially for XML.

14th December 1998

Rewrote parser for elements with CDATA content to fix problems with tags in script content.

New pretty printer for XML mode. I have also modified the XML parser to recognize xml:space attributes appropriately. I have yet to add support for CDATA marked sections though.

script and noscript are now allowed in inline content.

To make it easier to drive tidy from scripts, it now returns 2 if any errors are found, 1 if any warnings are found, otherwise it returns 0. Note tidy doesn't generate the cleaned up markup if it finds errors other than warnings.

Fixed bug causing the column to be reported incorrectly when there are inline tags early on the same line.

Added -numeric option to force character entities to be written as numeric rather than as named character entities. Hexadecimal character entities are never generated since Netscape 4 doesn't support them.

Entities which aren't part of HTML 4.0 are now passed through unchanged, e.g. &precompiler-entity; This means that an isolated & will be pass through unchanged since there is no way to distinguish this from an unknown entity.

Tidy now detects malformed comments, where something other than whitespace or '--' is found when '>' is expected at the end of a comment.

The <br> tags are now positioned at the start of a blank line to make their presence easier to spot.

The -asxml mode now inserts the appropriate Voyager html namespace on the html element and strips the doctype. The html namespace will be usable for rigorous validation as soon as W3C finishes work on formalizing the definition of document profiles, see: WD-html-in-xml.

13th November 1998 and earlier releases

Fixed bug wherein <style type=text/css> was written out as <style type="text/ss">.

Tidy now handles wrapping of attributes containing JavaScript text strings, inserting the line continuation marker as needed, for instance:

onmouseover="window.status='Mission Statement, \
Our goals and why they matter.'; return true"

You can now set the wrap margin with the -wrap option.

When the output is XML, tidy now ensures the content starts with <?xml version="1.0"?>.

The Document type for HTML 2.0 is now "-//IETF//DTD HTML 2.0//". In previous versions of tidy, it was incorrectly set to "-//W3C//DTD HTML 2.0//".

When using the -clean option isolated FONT elements are now mapped to SPAN elements. Previously these FONT elements were simply dropped.

NOFRAMES now works fine with BODY element in frameset documents.

Future releases may address:


Back/forward to the:
Last update: 17th April, 1999
Jochen M. Braun   (responsible for slight adaptions, not for the content of this document)
 (E-Mail: jbraun@astro.uni-bonn.de)