Implementation Notes

Bugs

  • Improperly sequenced hn (for example h1 followed by h3, instead of h2) will result in duplicate text.

Limitations

  • The id attribute is only preserved for certain elements (at least hn, images, paragraphs, and tables). It ought to be preserved for all of them.

  • Only the very simplest table format is implemented.

  • Always uses compact lists.

  • The string matching for <?html2b class="classname"?> requires an exact match (spaces and all).

  • The implicit blocks code is easily confused, as documented in that section. This is easy to fix now that I understand the difference between block and inline elements (I didn't when I was implementing this), but I probably won't do so until I run into the problem again.

Wishlist

  • Allow <html2db attribute-name="name" value="value"?> at any position, to set arbitrary Docbook attributes on the generated element.

  • Use different technique from the fake namespace prefix to name Docbook elements in the source, that preserves the XHTML validity of the source file. For example, an option transform <div class="db:footnote"> into <footnote>, or to use a processing attribute (<div><?html2db classname="footnote"?>).

  • Parse DC metadata from XHTML html/head/meta.

  • Add an option to use html/head/title instead of html/body/h1[1] for top title.

  • Allow an id on every element.

  • Add an option to translate the XHTML class into a Docbook role.

  • Preserve more of the whitespace from the source document especially within lists and tables in order to make it easier to debug the output document.

Design Notes

The Docbook Namespace

html2db.xsl accepts elements in the "Docbook namespace" in XHTML source. This namespace is urn:docbook.

This isn't technically correct. Docbook doesn't really have a namespace, and if it did, it wouldn't be this one. RFC 3151 suggests urn:publicid:-:OASIS:DTD+DocBook+XML+V4.1.2:EN as the Docbook namespace.

There two problems with the RFC 3151 namespace. First, it's long and hard to remember. Second, it's limited to Docbook v4.1.2 but html2db.xsl works with other versions of Docbook too, which would presumably have other namespaces. I think it's more useful to underspecify the Docbook version in the spec for this tool. Docbook itself underspecifies the version completely, by avoiding a namespace at all, but when mixing Docbook and XHTML elements I find it useful to be more specific than that.

History

The original version of html2db.xsl was written by Oliver Steele, as part of the Laszlo Systems, Inc. documentation effort. We had a set of custom stylesheets that formatted and added linking information to programming-language elements such as classname and tagname, and added Table-of-Contents to chapter documentation and numbers examples.

As the documentation set grew, the doc team (John Sundman) requested features such as inter-chapter navigation, callouts, and index and glossary elements. I was able to beat all of these back except for navigation, which seemed critical. After a few days trying to implement this, I decided it would be simpler to convert the subset of XHTML that we used into a subset of Docbook, and use the latter to add navigation. (Once this was done, the other features came for free.)

During my August 2004 "sabbatical", I factored the general html2db code out from the Laszlo-specific code, refactored and otherwise cleaned it up, and wrote this documentation.

Credits

html2db.xsl was written by Oliver Steele, as part of the Laszlo Systems, Inc. documentation effort.