Home - Kura - Datamodel

Messing about with Unicode, XML, XSL, DSSSL, Tex, Omega, Fop and the rest of the mess.

August, 2002 (And that's important, because next year the situation might be completely different)

(Update of December, 2002. Yes, the situation is quite different Fop is getting quite good, and even generated rtf nowadays. Without any formatting yet, and the pdf has gotten quite ugly, but stuff is starting to get working.)

Boudewijn Rempt

Disclaimer: this is a story without a happy end.

The problem at hand

Let's say you have a source of data that is going to be published. The data comprises text from many, many languages. English. Dutch. Chinese, sure. Malayalam, perhaps. Tibetan, of course. And it contains some pretty weird symbols. Like a shwa -- a topsy-turvy e: ə. In order to subsume all this data in one character encoding, everything is encoded in Unicode: the current standard for multi-lingual, unified text encoding.

You would like to print your text, publish your text on the web, and perhaps also to prepare it for further editing in a word-processing package. And you don't want to lose your Chinese characters, your IPA signs, or your mathematical symbols.

At this point, the tempation is great to use MS Word, and just print to PDF, print to paper, save to HTML. However, generating Word documents from data is a chore. And besides, Word documents are non-portable, HTML conversion is bad and finally, Word documents can't be used by other applications. No option.

There exists a system, or rather two, meant especially to encode texts that should be presented in many formats, and that should be accessible to other programs. The two systems are SGML and XML. It's with using SGML and XML to publish data that this article concerns itself.

So, in a nutshell: you have a lot of data, encoded in Unicode, that you want to use to generate a document that can be presented on paper, on the web and that can be used as the basis for further editing

The solution

XML is big. XML is hot. XML is cool. XML is also the little brother of SGML. Encoding your data in XML means you can use all those cool, free tools around to prepare your data for publication. In this particular case, the choice between XML and SGML only becomes important a little later on. For now, we have to choose a standard to encode our documents.

You will use XML to encode your document logically: making clear that this part of your document is a bibliography, that part an appendix, this a screenshot, and that an ordinary paragraph. You wouldn't use XML to encode that this bit of text is 12-points Verdana, and that that picture should be twelve pixels from the margin. Just the logical structure of your document.

TEI vs Docbook -- other DTDs

Two viable standards exist: Docbook and TEI. The first is a DTD -- a document type definition -- that contains most of the definitions you'd need to write an O'Reilly book. And that means it contains almost enough definitions to write pretty much any kind of non-fiction book. TEI, on the other hand, is geared towards encoding texts for linguistic purposes. If you want to write a dictionary, or to prepare a scholarly edition of an old Malayalam text, TEI should be your friend.

O'Reilly has published a lot of books; not many critical editions of old Malayalam texts are being published. O'Reilly understands open source, none of the established academic publishers understand open source. That means, inevitably, that there are many more tools available to massage docbook encoded texts into publishable documents, than there are tools to handle TEI encoded texts. In fact, I haven't found anything mature that will let me handle TEI encoded texts. The best I found was Boris Tobotras' TeiTools, which looks good, but isn't as mature as the current Docbook tools, and only works on Linux. So, let's drop TEI for the moment.

Other DTDs exist, but are proprietary, little used, or very small. Not worth spending a lot of time on, unless you're paid by the hour.

Docbook, therefore

Docbook is a mature DTD, and there are mature tools to transform Docbook into something publishable. There are two (or maybe three) routes you can take when working with Docbook:

The third option is to write something yourself: use, say, Python, and it's XML-handling libraries to massage your text into something else. That would be a lot of work, and most of the work might have been done before.

Of the two available tools to transform your docbook source into something else, DSSSL is the oldest. It uses OpenJade to parse your docbook sources, that can be encoded in either XML or SGML, and transform it into HTML, RTF, text or TeX. From TeX you can, of course, go to DVI, PS and PDF. DSSSL is a kind of Scheme (which is a kind of Lisp), which you embed in XML, that transforms XML to something else. XML, LaTeX or HTML. Or RTF.

DSSSL has never become really popular, probably because of the amount of Lispiness involved, which tends to scare people away, or because the tools never became very mature. More about that maturity later. Anyway, even before DSSSL really came into its own, something new was thought up. XSLT.

XSL is relatively new. It uses XSLT to transform XML to XML. That's right: your original XML, conforming to, say, Docbook, changes into XML that conforms to another DTD. HTML, for instance. Or FO. FO stands for Formatting Objects -- an XML DTD that describes the layout of a page instead of the structure of a document, as Docbook does.

If your goal is to produce HTML, then there's not much to choose. OpenJade can handle both SGML and XML, while XSL handles only XML source, but that's the difference. The results are comparable, and very useful.

If you want to produce postscript, PDF or RTF, then there's a big difference. Using DSSSL and Jade, you'd first produce a TeX file, which is then converted to DVI, then to postscript, and finally to PDF. Or you can generate RTF directly.

If you use XSL, then you need to separate applications, FOP and JFOR. FOP produces postscript and PDF, without the intervention of TeX, and JFOR rtf.

Finally, there's a weird hybrid: PassiveTex. This uses the FO files created by Docbook-XSL, to generate TeX - DVI - Postscript - PDF

The theory is fine...

DSSSL

The only serious DSSSL tool is OpenJade. Together with JadeTex, it promises to enable you to publish your documents in all the formats you like.

However, OpenJade was originally designed to work with SGML, not XML. For our purposes -- the publication of Unicode-encoded text, XML is a more natural choice, because the default encoding of XML is utf-8 -- a Unicode encoding.

If you handle XML with OpenJade, OpenJade will by default complain bitterly about any character outside the basic 32 to 128 ASCII range. That can be avoided by setting the following three environment variables:

SP_CHARSET_FIXED=YES
SP_ENCODING=XML
SGML_CATALOG_FILES=/usr/share/sgml/jade_dsl/xml.soc

This makes sure that sp, the XML-parser OpenJade uses, knows it should expect XML, not the weird SGML encoding of Unicode. It means you can use utf-8 to prepare your texts.

However! The RTF that OpenJade generates has a header that declares it to be encoded in the ANSI character set. Despite the nice Unicode characters, it won't be usable in your word processor. It might even be only half the document.

JadeTex

Furthermore, when you want to generate postscript or PDF, OpenJade will use JadeTex to typeset your text using TeX. TeX cannot handle Unicode decently. The default distribution of JadeTex contains a few definitions that will let you use some characters -- most of latin-1, and a lot of mathematics, but if you want to use more that that, you will have to extend JadeTex, and to install many, many additional TeX packages. Let's take the International Phonetic Alphabet as an example.

A clever person, wise in the way of computers, will have started wondering whether this hasn't been done before. Yes, it has. A quite comprehensive TeX package to handle Unicode exists. I mentioned it above. You only need to locate and install all the special packages with all the special fonts, to use utf-8 encoded LaTeX files.

Unfortunately, the output of OpenJade is not utf-8 encoded, and won't work with the Unicode Support for LaTeX package. Besides, locating and installing all the relevant font packages is a real job.

Omega

The aforementioned Unicode package works with plain TeX, by substituting the utf-8 codes by LaTeX macros. The result is getting printed output that has all the flexibility of Unicode, without having to unicodify TeX itself.

Unicodifying TeX itself is exactly what the Omega project promised. However, Omega is very much a failure. Its creators have been guided by Principles. They were conscious of the Desirability of Flexibility. They Knew about the Demands of Fine Typesetting of Complex Scripts. The result is that they have created a sixteen-bit version of TeX that can use user-defined transformation tables to transform unicode-encoded TeX sources.

Previous paragraph gibberish? One phrase should have alarmed even the most lay of men and women: user-defined. Yes, that's right. Omega doesn't know anything about scripts, fonts or text. You have to define transformations between your input and your desired output yourself. You cannot feed Omega your source, and expect output. If you feed Omega a latex file containing a shwa, it will produce an white hole. For fun, read the discussion on usenet on this topic.

You cannot run to the manual, because the manual is a very interesting piece of academic prose about the difficulty of the task, but useless for a mere user.

There's some help at hand, though, because one of the Omega authors, Bruno Haible, has also produced the utf8-tex package, currently in version 0.1. If you use that, you can use utf-8 encoded source and produce decent output. Or, you should be able to. I found I couldn't even run the demonstration files without dropping characters. Something is wrong there, and there isn't any documentation. Also, note the version number. 0.1. Released in 1999.

Finally, JadeTex cannot produce something that is useful for Omega either. Take a look in the source: file jadetex.dtx:

% Eventually we want Unicode input working, with Omega.

That means it isn't here, yet.

What would be needed to get OpenJade working with Unicode

It's tantalizing. It's so close -- and it is possible to produce something nice with OpenJade. But it's not enough. What's needed is:

Conclusion: TeX has failed our goals. It can produce beautiful results, even if by default a TeX document looks like the by-product of a fifties academic mathematics journal from the States, instead of a piece of typography, but it's far too messy to hope that it will ever become usable for multi-lingual document production.

XSL

After DSSSL came XSL -- with XSLT. This technology is definitely not mature, and that's what causes the problems here. Plus that it's certainly a case of re-re-re-re-inventing the wheel. After all, how often have we created something that prints a text in a particular format by now? Why another page description language?

Right, docbook-xsl contains by default two modules: one that translates docbook to HTML. That works perfectly, as far as I can see. And one that translates docbook to Formatting Objects. Might also work perfectly, but it is in such a state of flux that the poor applications that work with the Formatting Objects are always lagging behind, and failing miserably.

There's a third-party XSL module, one that transforms Docbook directly to LaTeX: db2latex, but unfortunately, it doesn't work at all with recent versions of Docbook. Nice idea, though.

All use of XSL needs an XSLT processor, such as Xalan (Java) or 4Suite(Python) for the initial transformation to HTML or FO. Then you use something else to transform FO to output.

Currently, there are three applications that can handle FO files:

You can use both FOP and PassiveTex to produce postscript and ultimately PDF. JFOR generates RTF files.

PassiveTex suffers from all the deficiencies of JadeTex, plus an additional deficiency: it's a pig to install. I like installing difficult things for a challenge, but PassiveTex almost defeated me. And then it didn't work correctly. It's one redeeming feature is the fact that it originates with the TEI crowd, so it should have been used to format TEI encoded stuff. But you can't develop an application that depends on PassiveTeX. That would be cruelty to users.

JFOR is a Java application, and as such quite slow. It's also young, immature and spotty. I haven't seen the output, because I couldn't create one simple docbook file that JFOR could transform to RTF. But, at least, it exists, and the code looks clean.

FOP, finally, is the hope of the Free Software World. It can generate PDF files from moderately complex files, PDF files that can include Unicode characters. There are two problems with FOP: it can't handle really long documents, because it tends to run out of memory, and because there are quite a few FOP entities it cannot handle at all, like linebreaks in table cells, linebreaks at all. And it tries to right-justify text in verbatim environments.

And, since it generates postscript from the ground up, it uses quite a simple font algorithm. That is to say, text is publised in a certain font. If that font doesn't contain a certain character, then the character will not be printed. It won't try to find the character in another font it knows about. That's very limiting, but it's also a difficult problem to solve. Trolltech has only now got it almost right in its Qt Gui toolkit. White holes in the text are unacceptable, and using a font that has all the characters but no bold or italic, like Arial Unicode, is equally unacceptable.

What's necessary to fix FO and Unicode

Let's for the moment forget about PassiveTex. Most of my remarks on OpenJade hold for PassiveTex, too. Let's concentrate on JFOR and FOP.

Obviously both tools need to mature a lot. That's first. All of FO must be supported. But I am beset by a nagging suspicion that FO is a second system. Too complex to completely implement, and moving forward too quickly.

If FOP includes proper font substitution, and JFOR can handle common Docbook-xsl things like % units, I am sure it will become a good solution for decent document production.

Conclusion

I have failed. During the past three weeks I have investigated ways of producing a generic document, encoding in Unicode, using the full range of Unicode characters, and publishing it in HTML, PDF and RTF.

Only HTML is possible today: PDF is possible to a very, very limited extent, and demands a lot of work from the poor document producer. RTF is a complete loss.

This is a very bad situation. I am not happy. I also do not know what to do now: go on extending JadeTex with new Unicode characters? Fix OpenJade and JadeTex to use ucs.sty? Fix OpenJade and JadeTex to use Omega (which needs fixing, too), and produce a complete unicode-enabled distribution of LaTeX en route? Hack JFOR and FOP? Give up and create LaTeX that uses ucs.sty next to XML? Go to the pub and drown my sorrows?


Resources

In the course of this investigation I have used the following websites, resources and bits and bobs of software.

The projects that caused this quest are:

Docbook, XML, XSL and relations

Tex, Latex, Omega and relations

Diverse alarums


Last modified: Sun Dec 22 22:10:57 CET 2002