Design Notes

This section attempts to outline the history of docbook2X and explain the decisions made during its development. [1] I welcome any criticisms and suggestions.

docbook2man-spec.pl

docbook2man (now called docbook2man-spec.pl to distinguish it from the other XML-based tools) is a rewrite of the first tool used to generate man pages for the GGI Project's documentation. However, at the time of the development of the first tool, GGI documentation was written using Linuxdoc. There were a number of solutions:

  1. Hack sgml-tools to support it. However, I was discouraged from this solution since I didn't really like the Linuxdoc DTD at all for its lack of semantic markup.

  2. Write and maintain the man pages separately. Given the fact that most GGI hackers don't have the initiative to write documentation, this solution was not considered feasible.

  3. Pick up the DocBook DTD, which has man-page-like markup.



Obviously, I picked the third option. I tried using Fred Dalrymple's docbook-to-man at first, but it fails on DocBook markup with extra whitespace; the roff output would preserve them even for elements where whitespace should not be significant and it looks all wrong. Whitespace collapsing (as well as certain escaping rules) were impossible to implement without hacking the already convuluted C code. [2]

For the first tool, I used Perl with David Megginson's SGMLS module and the included sgmlspl script, which offers a simple syntax for matching element names. Unlike docbook-to-man, transformation with sgmlspl is sequential rather than based on trees. I criticise instant (the C portion of docbook-to-man) for generating an in-memory tree of the source document but having an extremely weak and incapable transformation language, cursing users with the disadvantages of both techniques.

At the time of the rewrite, I also investigated DSSSL and Python (supposedly cleaner than Perl), but gave up because the learning curve is too steep for me and there are no easy-to-use modules for Python to process SGML.

As docbook2man-spec.pl progressed I had to come up with kludges to the following problems:

(Here is an instance of the last problem. This sentence comes after the itemizedlist but both the list and this sentence is contained in the one DocBook para.)

I also placed an additional constraint on the docbook2X tools to generate human-readable man pages output. This is useful for both debugging the tools as well as for other developers who want to do small updates to the documentation without access to the DocBook tools. To achieve this, newlines in the source are to be preserved so that the output does not consist of excessively long lines, even though whitespace is to be collapsed.

For docbook2man I also used the SGMLS::Output module, which sgmlspl stylesheets use by default. While there are nothing wrong with the functions themselves, using them as a base for automatic whitespace collapsing resulted in unclean code. Because any whitespace collapsing code has to keep state on the output stream, doing whitespace collapsing functionality on top of SGMLS::Output and output saving (SGMLS::Output allows stackable output streams, ideal for SGML-to-SGML transformation) never worked properly. (texi_xml now supports only limited output saving to guarantee that the output will never get out of sync.)

docbook2texi-spec.pl

Although there was no urgent demand for an Info version of the GGI documentation, a DocBook-to-Texinfo converter was an interesting project to eliminate one of the perceived defficiencies of DocBook.

docbook2texi-spec.pl followed the same structure as docbook2man-spec.pl. However, its sequential transformation meant that it could not easily support some DocBook features like endterm and xreflabel on links.

I pretty much abandonned this work to try tree-based transformation instead. The move to XML tools also occurred because XML::DOM was readily available but there was no SGML equivalent.

docbook2texi

I believed that using a tree transformation would also eliminate the docbook2man kludges above (Texinfo, while more lax than man, is not whitespace-insensitive like most SGML/XML document types). This is partly true: tree-based approaches are allowed to look ahead and the collapse whitespace of a whole text node (called “folding” in the code). Certain elements can have full control over how their children are processed.

In the process of writing docbook2texi, I hacked up the XML::DOM::Map module that does simple pattern matching on DOM nodes in the spirit of sgmlspl. The flow of the transformation process was recursive “templates” that were Perl hashes that associate a certain node pattern with the subroutine used to process it.

docbook2texi transforms directly to Texinfo markup on standard output instead of a result tree like some SGML/XML stylesheet languages or using functions that wrap Texinfo functionality. I considered the extra layer unnecessarily inefficient and did not see the utility of it since it cannot possibly hide all of weirdness of the Texinfo markup language.

This decision proved to be a poor one. Although the later versions of docbook2texi solved the problem better than docbook2texi-spec.pl, its code also turned into a mess with tr_cdata_fold and friends everywhere. In hindsight, not using result trees might be a feasible approach with smaller document types, but certainly not DocBook.

From a stylesheet design standpoint, modifying the source tree during the transformation is generally a bad idea. However, I found that global attribute handling (mapping IDs to Texinfo anchors, particularly) is easy to implement by modifying the source tree; simply traverse the tree once and insert elements into the tree and when doing the actual transformation, it can be matched independently without having to make explicit function calls in every template. The main disadvantage is that inserting different elements at arbitrary points may not work, and document validity guarantees are partially lost.

docbook2man

Naturally, docbook2man-spec.pl was rewritten using XML::DOM like docbook2texi. It also transformed non-refentry markup, instead of skipping it. In practice, converting whole chapters (for example) does not yield optimal results.

One approach that seems ideal would transform non-refentry markup to refentry markup, and then use specialized tools to transform that to man pages. However, that means the stylesheet has to do extra work to transform from one DocBook content model to another, and given that the refentry model is deliberately limiting to start with, there would likely be information loss.

docbook2texixml and texi_xml

Around the same time I encountered the above problem, I received notice of Mark Burton's dbtotexi XSLT stylesheets. I was skeptical of his approach at first because XSLT was designed for XML-to-XML transformation and not really suited to Texinfo transformation. On the other hand, XSLT stylesheets are easily made modular and in terms of certain features his stylesheets were better than docbook2texi. I was then intrigued by the idea of having a common layer under his stylesheets and my Perl tools exclusively for handling Texinfo escaping and whitespace issues.

I proposed a XML document type for this purpose. When I changed docbook2texi to docbook2texixml to output Texi-XML instead, the code was immediately much easier to understand. Most transformation code can just put XML tags around character data sucessfully. When texi_xml was written, it was also much easier to see how inline-level elements, block-level elements, whitespace and newlines are supposed to interact. Texinfo idiosyncrasies are unavoidable in the code, but I have strived to keep code size in texi_xml (and man_xml, the equivalent for man page transformations) to an absolute minimum.

Another benefit of Texi-XML is that it enables other stylesheets, customizations of stylesheets, and stylesheets for other document types to transform also Texinfo easily.

[The Texi-XML documentation [not yet written] further describes the goals and rationales of Texi-XML.]

docbook2texi-xslt

After completing most of docbook2texixml and texi_xml, I tried to modify Mark Burton's stylesheets to use the Texi-XML document type. Since it uses text nodes for both actual text content and pieces of Texinfo markup, modifying it seemed to mean rewriting the whole thing. Instead I looked at Norman Walsh's (who has a reputation for writing good stylesheets) XSL stylesheets. It is quite modularized and supports almost all of DocBook. I copied his templates and gradually rewrote it to transform to Texi-XML instead. Actually, it probably took the same amount of work as reworking Mark Burton's stylesheets, or even more, but I am satisfied with the results. It is well-documented and the Java dependency is hidden from the rest of the stylesheet using some templates.

Notes

[1]

I wish more software projects would take time to describe their design and architecture instead of using superficial marketing-speak. It really helps people determine if the software package is a suitable solution for their problems.

[2]

docbook-to-man with the ANS modifications by David Bolen claims to rectify these problems, but docbook2man had already progressed significantly when it was publically available.