# Docbook by Seth Kenlon Computers do math really well, and that's what they were used for, when they were first invented. But it didn't take long for users to repurpose their futuristic calculators into fancy, dynamic typewriters. Now human-readable text drives computing, so choosing the right format for the text you write is an important decision. Docbook is an XML schema. XML is an extensible markup language a lot like HTML. It's truly ubiquitous, but you may know it by RSS or Atom, the Open Document formats of LibreOffice and Apache OpenOffice, Inkscape and the SVG file format, and much more. In fact, it's safe to say that if you own a computer or mobile, there's XML on it. This is what it looks like in its raw form: My title goes here Paragraph text goes here.
A section title More paragraph text. Some in italics.
Docbook itself is easy to learn and easy to write, and it's also one of the most flexible formats available. What other formats, like Markdown and reStructured Text lack, Docbook provides. And what Docbook doesn't provide is made possible through generic XML. But why bother learning Docbook in a time when simpler alternatives exist? Why bother with a markup language at all when you can instead impose a little structure to your otherwise plain text and end up with highly-portable, computer and human-readable, data? Settle in. All will be revealed. ## Fail faster A distinct difference between working in simpler formats and working in Docbook is that when you get something wrong in Docbook, you"re told about it. Many other formats, like Markdown and HTML, fail silently. And usually that feels good, because the end result is that your document is rendered. You press the Enter key and your document gets processed by whatever parser or processor it requires for conversion, and you're done. What a great feeling. The reality of failing silently, though, is that it has still failed. You might have gotten output, and most of it might look just fine, but what about the error that didn't get caught? Maybe the error causes something to render incorrectly, but it's buried in page 42 of a 200 page document. When will you notice? Or maybe the error rendered correctly in the web version of your document but incorrectly for the print version. Docbook, like all XML, is famously strict. If you, for instance, place a <para> after you've closed your <chapter>, then your document build fails, and it generally fails verbosely. Since Docbook is XML, you can even run your source through xmllint to find errors early. Experiencing errors is never easy. It's not fun to watch your work fizzle out in a pool of illegal tags and syntax errors instead of building into a beautifully rendered EPUB, web page, or PDF. To get around that disappointment, most processors accept an option to temporarily ignore errors, such as `--skip-validation`, and there's a significant difference between a fatal ERROR and a mere WARNING, but ultimately failure is important. It identifies imperfections in your source and protects you from unpleasant surprises in your end product. ## Easier than it looks Docbook sometimes has a reputation for being hard to learn, but I have found that more often it's not Docbook that's difficult, but the unique tool chains people build around it that have the learning curve. Compared to HTML, Docbook's tags are self-describing. Do you want to write an article or a book? Start with either the <article> or <book> tag, respectively. Start a new chapter in a book or a new section in an article with <chapter> or <section>, respectively. Start a paragraph with <para>, an ordered list with <orderedlist>, enter a list item with <listitem>, and so on. Compared to Markdown and Asciidoc, Docbook appears complex, but if you consider all the rules that aren't intuitive in structured text, then Docbook's rules don't seem so bad. Learning syntax from the original Markdown "spec" was often a process of trial and error, followed by a series of desperate Internet searches, which meant wading through all the different Markdown flavors and parsers for the best applicable candidate for a correct answer. Commonmark, a project dedicated to defining a more arduous and strict specification, has helped, but users are often lulled into a sense of false security by how easy it is to learn the basics, only to find that achieving advanced results introduces a surprise learning curve. Luckily, Markdown accepts HTML as a fallback markup option, and there are several tools and Markdown variants out there to make up for what the original spec lacks. Even so, if you're writing complex documents for several different output targets, it may not be as easy as it looks in all the *learn Markdown in just 15 minute*-style blurbs. The flow of logic to learn something new in Docbook tends to be consistently simple: 1. Go to the Docbook site 2. Find an appropriate tag in the master list 3. Refer to the tag's documentation to find out how to correctly use it That's all there is to it. It's about the same as learning HTML; learn the basics in the first few minutes, and keep a reference handy to learn more as needed. Depending on how much you know about XML, there can be a few surprises, but the Docbook website clearly defines valid parent and child relationships for each and every tag, and each entry for each tag provides big blocks of examples. ## Semantics Finally, Docbook is important because it provides data about your data. Docbook tags aren't meant to dictate a style over your content, but to classify the information you are trying to convey. Like HTML and CSS, styling Docbook comes later, and it's completely malleable. Docbook tags provide semantic meaning to your words. Semantics might not seem that important to you now, but here are two great examples of times that metadata became truly important in the real world: 1. Before mobile phones existed, nobody on the Internet would have ever thought that a telephone number would ever need a <tel> tag. If anything, surely a <em> or <strong> tag would do. And then mobile phones happened, and people all over the world were browsing the Internet on the same device that they used to make phone calls, and it was a downright inconvenience not to be able to look up a company's phone number and then click on it to make the call. 2. A major phone company in New Zealand had been called Telecom for years. When they rebranded as Spark, throughout their entire online documentation, the word *telecommunication* appeared as *sparkmunication*. This glitch was live on their website for several days before the obvious find/replace error was noticed and corrected. Better regex would have helped, but it wouldn't have happened at all with Docbook entities or the <trademark> tag. Classifying the information you write is important now, and as technology develops. ## Create your first Docbook document the easy way Here's a quick and easy way to get started with Docbook. This method emphasizes learning Docbook tags and syntax rather than building a complex and flexible tool chain. 1. First, open a text editor. Use whatever text editor you are most comfortable with, as long as it can save plain text files. All the good ones do: [Gedit](https://wiki.gnome.org/Apps/Gedit), [Geany](https://www.geany.org/Download/Releases), [Kate](https://kate-editor.org/get-it/), [Nano](https://www.geany.org/Download/Releases), [Jove](https://opensource.com/article/17/1/jove-lightweight-alternative-vim), [Emacs](http://gnu.org/software/emacs), [Atom](https://atom.io/), and many others. 2. Open a web browser to [tdg.docbook.org/tdg/5.2](http://tdg.docbook.org/tdg/5.2) for reference. 3. Open another tab in your web browser to [tdg.docbook.org/tdg/5.2/article.html](http://tdg.docbook.org/tdg/5.2/article.html) and scroll to the bottom of the page. Copy the text in the example box and paste it into your text editor. 4. Use the example text as a template, and write something. Some of the example's header is more verbose than you probably need, so in my example I've trimmed off some of the excess.
My first docbook document Seth Kenlon opensource.com 2017
Introduction Introductory text goes here.
Section with a title Main body text goes here.
Conclusion Exciting and inspiring conclusion goes here.
If you are ever in doubt over whether a tag is required or not, just refer to the tag's documentation. The synopsis section tells you what is required and what is optional. For example, the <section> element specifies that one or more title-related elements are required, but that all other tags are optional. 5. Once you've finished writing, it's time to render your document. There are several XML processors available, but the easiest for beginners is [Pandoc](http://pandoc.org/). It's one of those "Swiss army knife" applications that converts almost any kind of text into almost any other kind of text. What makes it especially nice for Docbook is that it has attractive stylesheets by default, while most other processors render very generic output under the assumption that you intend to apply your own XSL stylesheet. There are all kinds of potential targets, but the commands are all basically the same: $ pandoc --from docbook --to epub3 --output myDocbook.epub myDocbook.xml $ pandoc --from docbook --to markdown --output myDocbook.md myDocbook.xml $ pandoc --from docbook --to html --output myDocbook.html myDocbook.xml $ pandoc --from docbook --to latex --output myDocbook.pdf myDocbook.xml And that's all there is to it. The more you write in Docbook, the more tags and attributes you learn, and eventually you'll probably find it hard to go back to a less explicit format. ![ PDF render ](render1.png) ## Advanced Docbook, with style Pandoc makes Docbook as easy as HTML, but XML is flexible, so if you need to, you can customize how you build your Docbook documents. The default Docbook render from most processors aside from Pandoc looks a little something like this: ![ Default PDF render ](renderdefault.png){width="6in"} It's professional, but painfully so. Still, it's an important foundation upon which additional styles can be applied. ### HTML and EPUB output If your target involves HTML, you can continue to use Pandoc, instructing it to use your custom CSS. $ pandoc --from docbook --to html \ --css=myStyle.css \ --output myDocbook.html myDocbook.xml $ pandoc --from docbook --to epub3 \ --epub-stylesheet=myStyle.css --epub-cover-image=cover.jpg \ --epub-embed-font=fonts/foo.ttf --epub-embed-font=fonts/bar.ttf \ --output myDocbook.epub myDocbook.xml The end result is dynamic, lightweight, modern, and attractive as you yourself make it. ### PDF and print output Rendering to PDF for either digital distribution or for printing relies either on LaTeX or XSL. I don't know or use LaTeX, so I choose XSL, but if you're a LaTeX user, you can [use Pandoc with custom templates](http://pandoc.org/MANUAL.html#templates). Otherwise, here's a brief introduction to XSL and the [xsltproc](http://xmlsoft.org/XSLT/xsltproc2.html) command. XSL is the eXtensible Stylesheet Language and is the CSS of the XML world. If you install Docbook from your Linux distribution or from the Docbook web site, then you are installing all the default Docbook stylesheets. These serve as the fallback styles whenever you use a tool like xsltproc or xmlto. If you cannot, or choose not to, install Docbook, you can point to the stylesheets manually in your xsltproc command. Building a PDF with xsltproc is a two-step process. First, you must generate the .fo file, which is a combination of your XML and your XSL, translated into XSL-FO (Formatting Objects) markup. Then you process the `.fo` file with Apache FOP, a Java application that converts Formatting Objects to PDF. $ xsltproc --output tmp.fo myDocbook.xml $ fop tmp.fo myDocbook.pdf An easy modification to make when just getting started with styling Docbook is your choice of font. Fonts are easy to change but make a noticeable difference in your end product. 1. The first step in adding to the default style is editing some external stylesheet. For font detection, create a directory called `fonts` in your working directory. Then create a file called `fonts.xml` and enter this text: ./fonts This registers all TTF fonts found in the `fonts` directory. Put whatever fonts you want to use in your PDF in that directory. 2. The next step when modifying style is to set your new style option so that your processor knows what it is. There are two ways to make a change to XSL parameters. You can set a parameter dynamically as part of your xsltproc command, or you can make the change in an additional stylesheet. I use both methods, depending on the gravity of the change. For simple styles that I find myself changing often, I pass parameters as part of my command. That way, I can change them quickly and easily and independently of my custom stylesheets. I can even set them to change based on a Makefile. To set fonts: $ xsltproc --string-param body.font.family "League Gothic" --output tmp.fo myDocbook.xml A list of valid parameters can be found at [docbook.sourceforge.net/release/xsl/1.78.1/doc/param.html](http://docbook.sourceforge.net/release/xsl/1.77.1/doc/param.html). To output to PDF, tell FOP to register your fonts with your `fonts.xml` file: $ fop -c fonts.xml tmp.fo myDocbook.pdf ### XSL stylesheet For styles less likely to change depending on printer requirements, page size, or mood, I place rules in a custom XSL template. XSL templates can get very complex, so making minor adjustments and learning over time is a good approach. A common visual cue in printed books is the idea of an admonition, like a note, tip, or warning, which gets a background color to let the reader know that a topic is separate from the current narrative but still important to the topic. Admonitions are distinct elements in Docbook, so they're relatively simple to style. The process is similar to styling fonts. First, create a new file called `mystyle.xsl` in your working directory. Edit it so that it contains this heading: The `xsl:import` line must point to the stylesheet on your system, whether you have installed it or you are using it from a nonstandard location in your home directory. In this same file, enter some style rules: This creates a template in your stylesheet for all elements that match the "note". Whenever the XSL processor finds a <note> tag, it drops in the XSL-FO blocks to describe how elements are to be printed (whether the paper is digital or physical). Apply the styles with xsltproc and output to PDF to fop: $ xsltproc --string-param body.font.family "League Gothic" \ mystyle.xsl --output tmp.fo \ myDocbook.xml $ fop -c fonts.xml tmp.fo myDocbook.pdf And the output: ![ Styled PDF render ](note.png) The syntax is nowhere as terse or simple as CSS syntax. However, simple styles all follow the same format: 1. Create an <xsl:template> block for the tag that you want to affect. 2. Look up the available XSL attributes at [docbook.sourceforge.net/release/xsl/current/doc/fo/index.html](http://docbook.sourceforge.net/release/xsl/current/doc/fo/index.html) 3. Set the attributes you want to apply in a <fo:block>. Like CSS, it takes time and practise to get to know all of your options, but once you get the hang of it, it's simple. More complex XML gets you more complex rules with dependencies, variables, conditionals, and more. For an exhaustive overview, see the definitive [sagehill.net/docbookxsl](http://www.sagehill.net/docbookxsl/) web site. ## Using Docbook Docbook was invented for tech writers, and many of its tags reflect that. However, I use Docbook for everything, whether it's tech writing, fiction, or [RPG design](http://www.dmsguild.com/product/219635/Adventure-Template-For-Docbook-XML). It's a powerful, industry-strength system. This doesn't mean that there's no place in the world for Markdown or org-mode or other text formats. If I'm writing a `README` file or a short note to myself, Docbook is overkill, because the source document is also meant to be the final delivery format of the document. In other words, where I would historically have used plain text, I use Markdown because Markdown's structure is a vast improvement over unstructured text. I also use Markdown as an intermediate format. I usually write [opensource.com](http://opensource.com) articles in Docbook and then output to Markdown so that a site editor can easily review and convert my work. Going directly from Docbook to HTML is great if you're running your own site and can govern what tags, classes, and IDs get used, but Markdown serves as an excellent intermediate step when you temporarily want to ignore your source metadata and just deliver the written words. For everything else, Docbook is a great solution. Give it a try, and you'll never look at word processors, text, or XML the same way again.