Saturday, August 12, 2006

 

XML Isn't "Self-Describing"

I am so sick and tired of reading that XML is “self-describing.” It isn't. I could link to 100 web articles or blog posts that proclaim that it is, and even the popular "Learning XML" book by Eric Ray that I've used to teach XML says it is ("Creating self-describing data" is on the book cover). But I was working with XML for 10 years even before it existed , back when it was a 4-letter word (that's a joke about SGML that I credit to Bob DuCharme), and it wasn't self-describing then and never will be. And that's a feature, not a bug.

Let me try to be charitable, and assume that what people mean when they say XML is "self-describing" they are really saying "compared to something else that clearly isn't." So the least "self-describing" information consists of just a stream of the alphanumeric characters being represented by some text format, as they might be on a punch card. This delimiter-less encoding doesn't even make explicit the tokenization of the characters into meaningful values, so there isn't even any "self" to which any description could be assigned:

850456719990105

The information here has been encoded in a position-sensitive way, and it turns out that there are three different information components that occupy fixed-length fields in the text stream. But we can't begin to describe the information here unless we have some mapping of positions to values. The possibility of description emerges when we separate the values with commas or some other delimiter character, which tells us what information components must be described:

850,4567,20060812

But commas as delimiters provide no clues about what the components mean, do not enable any association of one component to another, and do not enable one component to be contained within another. A text encoding syntax that uses multiple delimiters like EDIFACT is a step closer to self-description, because it can implicitly represent structural or semantic hierarchy among components. XML goes one step further with the syntactic mechanisms of paired text labels to distinguish the information components in a stream of text and quotes to associate one bit of information as an attribute of another. So an XML encoding of this text stream might be:

<xxx yyy="4567">850</xxx>
<zzz>20060812</zzz>

The <, >, and " characters distinguish the information being described from the "markup" that is part of its description. This syntax allows more flexibility in the encoding of the text stream (without positional encoding, we no longer have to assume that the values are of fixed length). But the information isn't described by these syntactic markup mechanisms, and that's all that XML per se is contributing so far.

I suppose that it is the text labels inside of XML's syntactic delimiters that cause most people to think that XML is self-describing. But these tags aren't part of XML, so it isn't XML that is doing the work. But what do these "tags" really contribute anyway? Instead of xxx, yyy, and zzz, I might have encoded the text stream this way:


<TransactionType reference="4567">850</TransactionType>
<Date>20060812</Date>

Using text labels in a language we "understand" might give us a warm feeling that we are describing the text content, but the tags really don’t do that.

Choosing the terms used for tags or naming anything is often a difficult and contentious activity. Everyone naturally creates names that make sense to them, but even when describing exactly the same thing, chances are very good that two people will choose different names for it. And they will often use the same name or tag for different things.


"TransactionType" and "reference" and "Date" might suggest something about the meaning of the content, but "suggesting something" is not enough to make it self-describing. To someone familiar with the ANSI X12 EDI standard, a "TransactionType" with a value of 850 is a Purchase Order, but most people wouldn't have any idea that I used this interpretation to make up this example. Does "Date" mean the date of the purchase order or the date I wrote about it? What about a <Price> tag -- does this tag describe the retail, wholesale, discounted, or FOB Sydney price? Does it describe the price for a single item, a dozen, or a pallet-full? What's the currency? The tag by itself can't possibly distinguish between these different descriptions, so it doesn't make the information self-describing.

To be self-describing the XML syntax and tags would have to simultaneously convey both the specific information they mark up, all the semantic nuance needed to distinguish among synonyms or related concepts, and all the rules that govern relationships to other content – all without any additional information. If XML syntax and tags could magically do that by themselves we wouldn't need schemas or any documentation or other metadata. So as we said at the end of our Geometry proofs, Q.E.D.

Postscript:

What do you suppose the "rating" and "weight" tags mean (from
Google's recommendations to users of its "Google Base" service
):

<g:rating>4</g:rating>
<g:weight>5</g:weight>

Make a guess. You will probably be wrong. The tags aren't "self-describing" enough.

-Bob Glushko







Comments:
Would it be fair to say "When viewing a single instance (record) encoded in XML, it is, on average, more self-describing than any other commonly used encoding mechanism"? Are there any other popular encoding methods that use (English) language elements as delimiters? Given both of these, I think it is fair to describe well-thought-out XML as "relatively more self descriptive."
 
Yes, it would be fair to say that XML is on average more self-describing that other text encoding syntax. But that's like saying that the average midget is taller than the average baby. Neither is tall enough to play basketball.

And most of the people who think that XML is "self-describing" don't know how to make "well-thought-out XML" anyway. Putting angle brackets around proprietary and ambiguous semantics just makes for proprietary and ambiguous XML. XML isn't magic.

bob glushko
 
Bob, interesting point and I have to agree with you XML doesn't lived up to it's promise. However, don't you think that it is possible within "known" problems to create self-describing information with XML?

For example, a known problem such as ordering dinner at a resturant or a personal address book might be very capably handled by XML. There might be variations, but I think we could agree on a dozen tags that were obvious within the context. I might not be familiar with the < holdpickles/ > tag, but intuitivley I would understand given my familiarity with the problem.

OTOH, say you are unfamiliar with the game of Chess, so tags such as "Knight," and "Pawn" are not self describing as you argue (although a FYI link to Wikipedia or some other source could help create better context for those unfamiliar with the problem space).

I get what your saying that XML as "self docuumenting" is an exageration but would you agree it has been possible to chip away at the larger problem of the complexity of managing information with XML and say that many information spaces are better off for having XML formatting?
 
agree...

Simply wrapping data in a tagged document conforming to the XML format standard does not equal "self-describing"... quite literally XML provides ease in locating the information you need for integration (given an agreed understanding of what the document represents).

Having undertaken a significant amount of XML integration work, the words in the tags are actually irrelevant to anyone who does not know the schema, nor has not agreed with the provider what the data wrapped by those tags represents.

Every integration solution I have worked on required data, schema and data semantic agreement. This usually emerged from a Data Quality review to ensure that the data was "clean", and that the timing and context of the data was appropriate for use.

Not that I'm saying XML is not very useful. Just that today we live in a world where agreement and standards are the key and that includes what tag labels actually mean to data provider and consumer!
 
HTML is self-describing. Case in point: what does <img src="example.jpg"/> mean, or <blockquote>?

Many XML-based languages are self-describing because the tags and attributes are named unambiguously. But XML itself is not.
 
Post a Comment

Links to this post:

Create a Link



<< Home

This page is powered by Blogger. Isn't yours?