World Wide Web Consortium Member Find out more about usContact WinWriters at 1-800-838-8999
Link to WinWriters home pageReceive information about our eventsLink to our discussion and jobs ForumLink to the Online Help Resource Directory
Link to WinWriters home

XML: What Do Help Authors Need to Know? Part 1

By Scott Boggan


This article contains links to sample pages that require an XML viewer, such as Microsoft Internet Explorer 5. Screen captures of the sample pages are also shown for the benefit of those readers who do not have an XML viewer.

It has received more hype than the return of Star Wars and Austin Powers combined. If you believe what you hear, XML—or Extensible Markup Language—will suddenly make the Web incredibly fast and oh-so-easy to use. XML has been called the ASCII of the future, an Esperanto for the computing world. Like most hype-ridden things, XML is very difficult to comprehend; ask 10 Web jockeys to explain it and you'll get 10 different stories.

But if you sort through the hype, you'll see that XML has great promise for hypertext authors, especially for use in structured documents such as online Help. This two-part article will give you an overview of XML and its helper technologies—Extensible Stylesheet Language (XSL) and Xlink—and predict what this exciting new area holds for Help authors.

This article contains samples that will require an XML viewer. I've focused on the XML implementation in Microsoft Internet Explorer 5, which is available at www.microsoft.com/windows/ie/default.htm.external link If you don't have IE5, see www.xmlsoftware.com/browsers/external link for a list of alternative browsers.

What is XML?

XML is a set of rules that let you create custom tags describing the meaning and structure of a document. Such tags are often called "metadata," or data about data.

To help put this in perspective, it's useful to think about how we use markup languages. As Help authors, most of us are very familiar with two prominent examples—HTML and RTF—both of which use markup tags to format a document. A review of HTML's 80-odd tags reveals that with but a few exceptions (<code> and <address> are two obvious examples) they all control document presentation.

Let's look at a simple HTML document that lists a book. HTML tags such as <h1> and <b> describe the appearance of the document, but don't tell us anything about what kind of data it contains.

<!DOCTYPE HTML PUBLIC 
    "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
        <title>A Book</title>
</head>

<body>
<h1>Book</h1>
<p><b>Title:</b>
 The Autobiography of Benjamin Franklin <br> 
<b>Author:</b> Benjamin Franklin<br> 
<b>Price:</b> $8.99<br> 
</body>
</html>

HTML is known as a specific markup language because it was developed for use with a specific processor: a Web browser. Likewise, RTF was specifically designed to format text in a word processor. Markup languages that format a document for a specific reader are fine in some cases, but are not very flexible. A big problem is that re-using the document on another system requires you to convert it and sometimes perform manual cleanup.

Another problem with specific markup languages is that their tag set is not extensible. In the case of HTML, this limitation forces authors to find a workaround or wait for a standards body like the World Wide Web Consortium (W3C) to invent a new tag. Consider an example from the world of Microsoft HTML Help: wouldn't it be easier for authors if there were a simple HTML tag for adding an A-keyword? Instead, Microsoft implemented A-keywords using awkward ActiveX <OBJECT> code; in XML, it will be easy to define a new tag called <a-keyword>.

Finally, specific markup languages make it difficult for software developers to write programs that process data. We humans quickly recognize our HTML document as a book, but a computer program won't have any idea what it describes. Once our data is in XML, Help applications can do a much better job of delivering our content. This will not only provide the user with more targeted information but also simplify the authoring process.

If specific markup languages describe the presentation of a document, generalized markup languages take a different approach and use tags to describe a document's structure or meaning. The most popular example is the Standard Generalized Markup Language, or SGML. SGML is powerful but complicated, so in 1996 Jon Bosak of Sun Microsystems formed a (W3C) working group to marry the flexibility of SGML with the simplicity of HTML—in effect, to create an SGML "lite."

Now let's look at our book listing example in a generalized markup language. Not to get into the syntax details just yet, but here's what it might look like in XML.

<?xml version="1.0"?>
<BOOK>
      <TITLE>The Autobiography 
          of Benjamin Franklin</TITLE>
      <AUTHOR>
         <FIRST-NAME>Benjamin</FIRST-NAME>
         <LAST-NAME>Franklin</LAST-NAME>
      </AUTHOR>
      <PRICE>8.99</PRICE>
</BOOK>

Because the tags describe the meaning (or "semantics") of the data, this document can be read by virtually any application. This ability to share data in a standard format will enable the next generation of Web-based applications. And unlike most file formats, even humans can read it! No wonder that in just a short time, XML has captivated the attention of many and it is now a W3C "recommendation" (see www.w3.org/xmlexternal link).

Domain-Specific Markup Languages

Because it allows authors to create their own tags, XML has spawned a variety of "domain-specific" markup languages: tag sets that are unique to a particular profession. Before considering how XML might be used in Help, let's look at how other industries are using it.

  • Open Financial Exchange (OFX)  is a data interchange language used to describe financial transactions. For example, both Intuit Quicken and Microsoft Money use OFX to exchange information with a bank.
  • Mathematical Markup Language (MathML)  Mathematicians have long needed a way to include equations and formulas in documents, and MathML provides a framework for doing so in Web pages.
  • Microsoft's Channel Definition Format (CDF)  is used to define channels that automatically transmit information to the desktop. This so-called "push" technology essentially provides a way to broadcast Web data.
  • Synchronized Multimedia Integration Language (SMIL)  Pronounced "smile," SMIL provides a way to produce TV-style multimedia for the Web. SMIL documents don't actually contain multimedia (such as video and sound), but instead tell a Web browser how to sequence and synchronize the display of text and multimedia elements.
  • Resource Description Framework (RDF)  is a general application that may prove to be one of the most important uses for XML. RDF essentially provides a way to store metadata, including sitemaps, content ratings, search engine data collection (Web crawling). One of the most interesting aspects to Help authors is its ability to describe collections of pages that represent a single logical document.

Potential Uses for XML in Help

As you can see, plenty of other industries have latched on to XML as an answer to their publishing needs. How might we use XML in Help? Here are five possibilities I've dreamed up; certainly not all of these ideas will bear fruit, but perhaps a few will get your wheels turning.

  1. User profiling. An important part of building smarter documents is to tailor them to the reader's needs. Using XML, you could mark individual sections of a Help topic based on different criteria—whether experience level, job description, or product configuration—and allow the user to filter the Help system to display only information that is pertinent to his or her needs. This concept is nothing new: it is part of SGML and is also the idea behind Microsoft's half-baked HTML Help "information types."
  2. Improved searching. Search engines are not very discriminating: they return all topics containing the specified text. For instance, a search for the word "printing" or the "document.write property" will more often than not produce a very long list of topics. XML (and especially, RDF) allows search engines to do a much better job of retrieving documents, so that a user could search for all procedures describing printing, or find all code samples containing the document.write property. This is sure to bring data searching to the masses.
  3. Advanced linking. We can already create "one to many" links in HTML Help using ALinks—XML does that plus a whole lot more. For example, XLL (Extensible Linking Language) supports bi-directional links: a single link that allows the user to choose forward or back. This is much like WinHelp style browse sequences, only more powerful and much easier to author. Another innovation is extensible links, which might allow an author to link to "the first four paragraphs of topic A" or to "an image with a caption named 'Marsupial.'" This will allow authors to more easily re-use content and provide a much richer user experience.
  4. Topic templates. Another potential use for XML is to define topic templates. Embedding XML tags in Help topics could be used to instruct a compiler or an authoring tool to perform certain actions. For example, metadata could be used to indicate that the current topic is a procedure and should always appear in a small window, or that "show me" topics should automatically run a script when displayed. This could save authors from repetitiously coding redundant information in each topic.
  5. A markup language for Help. Defining a set of tags for Help—perhaps called HelpML?—would go a long way toward standardizing the navigation formats and tag extensions used in HTML-based Help. As a start, HelpML could easily define a standard format for TOC entries and search keywords—a replacement for Microsoft's "sitemap" format. It could also define a standard <help> tag for providing context-sensitive Help on standard Web pages.

    A Help markup language might even be expanded to define a tag set for standard Help elements, such as section title, subtitle, conceptual overview, steps, tips and notes, and code samples. A universal Help markup may not be practical, but a documentation group could easily define their own tag set and use it to enforce internal standards. This would be an online equivalent to an ongoing initiative called DocBook (www.oasis-open.org/docbook/intro.htmlexternal link). DocBook is an SGML standard for producing printed computer hardware and software documentation, but plans are underway to produce an XML version as well.

    An XML standard for Help would greatly improve our ability to create content from a single source and easily publish it on any XML-enabled platform—from HTML Help to JavaHelp, standard Web, print, or PDF—without performing any conversion. WinWriters has organized a group from the Help authoring community that is working on standards for Web-based documentation; for more information, visit www.help4web.org/index.html.

What Does XML Look Like?

Let's look at a more expanded version of our XML book sample. Once again, notice that each tag describes the data it contains; for example, <LAST-NAME> is the author's last name.

<?xml version="1.0"?>
<?xml-stylesheet href="books2.xsl"
    type="text/xsl" ?>

<!-- This file represents a fragment 
    of a book store inventory database -->
<BOOKSTORE>
   <BOOK GENRE="autobiography">
      <TITLE>The Autobiography
          of Benjamin Franklin</TITLE>
      <AUTHOR>
         <FIRST-NAME>Benjamin</FIRST-NAME>
         <LAST-NAME>Franklin</LAST-NAME>
      </AUTHOR>
      <PRICE>8.99</PRICE>
   </BOOK>
   <BOOK GENRE="novel">
      <TITLE>The Confidence Man</TITLE>
      <AUTHOR>
         <FIRST-NAME>Herman</FIRST-NAME>
         <LAST-NAME>Melville</LAST-NAME>
      </AUTHOR>
      <PRICE>11.99</PRICE>
   </BOOK>
   <BOOK GENRE="philosophy">
      <TITLE>The Gorgias</TITLE>
      <AUTHOR>
         <NAME>Plato</NAME>
      </AUTHOR>
      <PRICE>9.99</PRICE>
   </BOOK>
</BOOKSTORE>

A programmer might look at our XML document and think of it as a tree. The root element is the bookstore element, which contains elements for genre, title, author, and price. The tree structure begins at the root and gradually branches out to the other elements.

Representing an XML document as a "tree"

This tree structure isn't important to us as Help authors or to our readers, but it makes it easy for developers to write software programs that process XML documents. Also, thinking of your XML documents in terms of a tree structure should highlight XML's appeal for producing structured documents such as online reference manuals.

An Overview of XML Syntax

Let's briefly look at some of the rules for creating XML documents. First of all, you'll notice that our sample begins with a declaration: <?xml version='1.0'?>. It contains several custom XML tags, which are more formally called elements; for example <book>. Elements can also have values assigned to them using an optional attribute; for example, <book genre="novel">.

If you're familiar with HTML, XML's syntax has a few twists that will take some getting used to.

  • XML tags are case sensitive. For example, changing <AUTHOR> to <Author> or <author> will break the document.
  • All XML tags must all be closed. Example: each <p> must have a corresponding </p>
  • Empty tags (<IMG>, <BR>, <HR>) must have a slash in front of the closing bracket: />. For example, here's the tag used to insert an image: <IMG SRC="tazdevil.jpg"/>
  • All attribute values must be quoted. For example, the URL in this tag must include quotation marks: <a href="http://www.yahoo.com">

Another difference between HTML and XML is that while Web browsers are very forgiving about bad HTML code, XML processors are not. A single invalid tag will result in an error message and prevent your document from appearing at all, as in the following example:

Error message from an XML processor

If you code your HTML pages manually, now's a good time to start paying attention to make sure that you close tags whenever possible, add quotation marks around your attributes, and use consistent capitalization.

Adding an XSL Stylesheet

As it sits, our XML document is very boring: opening it in Internet Explorer 5 displays its structure, a view that you'd never want to inflict on your readers. (Netscape users are currently unable to view this document, but the upcoming "Gecko" release is scheduled to support XML. For more information, see www.mozilla.org.external link)

Click here to view the XML document. For those readers without an XML processor, the XML document appears in Internet Explorer 5 as:

Representation of an XML document in Internet Explorer 5

Linking to an XSL stylesheet lets us format our document into something more usable. We won't get into the particulars of XSL just yet, but we'll point out a few things about our stylesheet. First, you'll discover that the stylesheet begins with an XML declaration, since XSL stylesheets are themselves XML documents.

<?xml version='1.0'?>

<body bgcolor="ivory" 
    xmlns:xsl="http://www.w3.org/TR/WD-xsl">
  <style> {font-size: medium; font-family:
      Verdana;} </style>

       <table border="2" 
           cellpadding="5">
         <tr>
           <th>Author</th>
           <th>Title</th>
           <th>Price</th>
         </tr>
        <xsl:for-each select="BOOKSTORE/BOOK">
          <tr>
            <td><xsl:value-of select="TITLE"/></td>
            <td><xsl:value-of select="AUTHOR"/></td>
            <td><xsl:value-of select="PRICE"/></td>
          </tr>
       </xsl:for-each>
  </table>
</body>

Notice also that our stylesheet contains HTML tags such as <body> and <table>. In effect, the stylesheet is converting our XML document into HTML that can be displayed by the browser. This is probably a good time to mention that XML is not going to replace HTML. As our example shows, HTML, XML, and CSS will all work together to create Web documents.

Here's what our stylesheet-enabled XML document looks like. For those readers without an XML processor, the XML document appears in Internet Explorer 5 as:

Representation of an stylesheet-enabled XML document in Internet Explorer 5

Coming in Part 2

In Part 2 of our article, we'll explore XML syntax in greater detail. We'll also examine tools for creating XML documents and look at the power of XSL and XLink in transforming XML documents.


Scott Boggan is co-author of the award-winning Developing Online Help for Windows and a forthcoming book on HTML Help. He is a popular speaker at numerous conferences throughout the world and also teaches through the University of Washington. Scott is principal of HelpCraft (www.helpcraft.comexternal link), a training and consulting company.


up

Copyright © WinWriters. All Rights Reserved. sharon@winwriters.com
Last modified on