The Two Perspectives on XML

I have been working with XML since it was a glimmer in the eye of Jon Bosak. In fact, before XML was conceived, there was SGML; going from SGML to XML represented a streamlining for the web, but at its core there was not much functional difference; in fact XML is a subset of SGML. The key concept of semantic markup is central to the core value of SGML/XML.

The two main perspectives I have seen are Document-centric XML and Data-centric XML. SGML initially appeared in support of document-centric work: managing all the technical documents or contracts of IBM or Boeing, for example. Charles Goldfarb has maintained that “SGML literally makes the infrastructure of modern society possible” and I think he’s right – hmm, should we blame him for the lengths to which humans have gone to destroy the earth?

The document-centric XML world is really a direct continuation of SGML. When XML came out as a standard in 1998, those of us working with document-centric XML became giddy with excitement, anticipating that the standards being proposed at the time (notably XML itself, XLink, XML Schema, RDF, XSL and pre-cursors to SVG) would finally facilitate tools that made publishing work for organizations that weren’t quite as big as IBM or the Department of Defense. The vision of a semantic web and ubiquitous XML multi-channel publishing, seemed to be growing a foundation in theories gaining critical mass, with apparent support of software companies. It appeared these vendors might actually adopt the standards of the committees they were sitting on. “Throw away Xyvision!” I told my boss at Bertelsmann, “this XSL-FO will completely revolutionize database publishing!”

We were sorely disappointed over the next five years. In the years before 1998 W3C standards seemed magical; concepts from the standards were implemented relatively quickly, without perfection but with steady progress: browser updates would reflect CSS and HTML advances; even Microsoft was shamed into some level of compliance. But the monopolistic tendencies of those on the standards committees, coupled with the academic approach of some of the standards committees, managed to make it less and less likely that a given standard would find a functional implementation.

And there was that other perspective – the data-centric side of things. For many reasons, XML was at the right place at the right time in terms of data management and information exchange. In fact, the very year that XML became a standard, it also became the dominant way that machines (servers) talked to each other around the world. Highly convenient for exchanging info, as firewalls would tend to block anything but text over http, while XML markup would allow any sort of specification for data structures, and validation tools would ensure no info was lost.

In 1998, when you asked a programming candidate “what do you know about XML?” only the document-centric people would know anything. By 2000, everyone doing any serious programming “knew” about XML. Trouble was, they typically knew about “XML” only in the much easier-to-use, irrelevant-to-publishing, sense.

And the standards now had to accommodate two crowds. The work of the W3C XML Schema Working Group, in particular, showed the disconnect. Should a schema be easily human readable? What was the primary purpose of Schema? Goals were not shared by the document- and data-centric sides, and data-centric won out, as they have tended to dominate the XML space ever since that time. RELAX NG came about as an alternative, and if you contrast RELAX NG with W3C Schema, you will see the contrast between the power of a few brilliant individuals aligned in purity of purpose and the impotence of a committee with questionable motives and conflicting goals. Concurrent with a decline in the altruism of committee participants was the huge advance of data-centric XML and the disproportionate representation of that perspective.

Ten years later, we find in the document-centric world that toolsets related to XML in a data sense – parsing, transforming, exchanging info – have made great leaps forward, but we are in many ways still stuck in the 1990s in terms of core authoring and publishing technologies. It is telling that descendants of the three great SGML authoring tools as of 1995 – FrameMaker+SGML, Arbortext Epic, and SoftQuad’s Author/Editor, are, lo and behold, the leading three XML authoring tools in 2009.

There have been some slow-paced advances in document-centric XML standards and tool chains as well, especially the single bright light out there for us, Darwin Information Typing Architecture (DITA) which came out of IBM like XML itself. Yet standards for rendition, XSL-FO and SVG especially, have not advanced along with core proprietary rendition technologies such as InDesign, Flash, or Silverlight, though all of these enjoy nicely copied underpinnings pillaged from the standards. More important, nothing has stepped in to replace the three core authoring tools: the “XML support” of Microsoft Word and Adobe InDesign, for example, do not approach the capabilities of a true XML authoring application. There are a proliferation of XML “editors” but most of the new ones are appropriate for editing a WSDL file or an XML message (the data-centric forms of XML), not a full-fledged document.

Meanwhile, on the data-centric front, XML has simply permeated every aspect of computing. There are XML data types in database systems, XML features in most programming languages, XML configuration files at the heart of most applications, and XML-based Web Services available in countless flavors.

Document-centric XML is simply a deep challenge that will take more time (and probably more of a commercial incentive) to tackle. For the time being, structured authoring managed the XML way is still implemented mainly by very large organizations: such an approach has “trickled down” from organizations the size of IBM to organizations the size of Adobe (which does, in fact, use DITA now), but there are not tool chains yet available that will bring it down much further. The failure of the W3C XML Schema Working Group to provide a functional specification supporting document-centric XML can hardly be underestimated.

As long as content is not easily authored in a semantically rich, structured fashion, the vision of the semantic web will remain an illusion. When and if document-centric XML gets more attention from standards bodies and software vendors, human communications will become far more efficient and effective.

  • Share/Bookmark
  1. loarabia says:

    Interesting read. I’d love to hear more. I’d be particularly interested in hearing your definition of document-centric XML and data-centric XML along with some examples to get a feel for the nuances you see in the two models and I’d also love to hear your telling of some more of the history and evolution of the standards.

    thanks for a good read

  2. Bill Trippe says:

    Good thoughts. One question–did you ever get rid of Xyvision? I still see it in a lot of scientific and technical publishing.

  3. admin says:

    Thank you both for the comments! Yes I should have described more of just what differentiates the document- and data-perspectives… will post again on this I’m sure, but in general the XML I see in the data world tends to be message-oriented: often a very flat structure, usually much smaller files, relational data wrapped in tags, SOAP messages, WSDL, etc. Developers usually love tools like XML Spy: they vastly prefer schema to DTD (Microsoft even went so far as to say “DTDs are a security risk” LOL) as strong data types do make a tons of sense for them, and as structures are often defined automatically they don’t see much value in human-readable schema. Document-centric XML has to handle, um, documents, with characteristics such as component reuse, hyperlinking, cross-references; the things DITA handles are generally relevant only to this side of XML. Many in the document-centric world still hand code DTDs. Of course XML is XML, so it is a continuum without a strongly identifiable border.

    When we started Silicon Publishing in 2000, we stopped using Xyvision. Initially this was not our first choice but economic reality. However, being forced into early adoption of InDesign automation turned out to be a very good thing. We built apps that opened up the desktop app and would run for 22 hours to produce, for example, 100 healthcare directories from a database, driven by a rules table. We begged Adobe for an InDesign Server for 5 years, and were in the beta before the first one came out: my request to join the beta was something like “I am probably the only person on the planet who literally dreams about an InDesign Server.”

    Xyvision was (is?) great! Very fast, great composition, one of the first/best tools to automate typographical craft. It will probably stay around a long time because many automated document generation programs work with it – we find people trying to move away from it, yet typically there is significant rework. I hope SDL keeps it going; they sure have acquired many companies. I still like InDesign Server better as it surpasses the typography/graphics, but I miss the speed.

  1. There are no trackbacks for this post yet.

Leave a Reply