Better XML validation than DTD


	Main Page

Main page

	About Me

Software I wrote Resume Friends of mine Pictures Musicianship Stuff I have for sale

	Personal News

2010: March, April. 2009: January, March, August. 2008: Jan, Feb, Apr, May, July, August, September, October. 2007: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2006: Jan, Feb, Mar, Apr, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2005: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2004: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2003: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2002: Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2001: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec. 2000: Jan, Feb, Apr, May, Jun, Jul, Aug, Oct, Nov, Dec. 1999: Jan, Feb, Jun, Oct, Dec. 1998: Jul, Aug, Sep, Nov.

	Geek Stuff (computer related)

Digital Music Java Why LiveWire Sucks Why ASP Sucks (a bit) Linux MacOS Unix Oracle Perl Emacs O'Reilly

	(some of) My Interests

Humor Sony Playstation Cars

	Search



	Ads

Executive summary: XML DTD sucks. XML Schema is probably what you should be using instead.

I recently had a discussion with a friend (and former coworker) who is one of the most experienced software engineers I know. One of the main reasons I respect him is that he is in favor of developer diligence, and best practices that support it - in other words, he's in favor of practices, tools, and technologies that make lazy developers do a little work now to keep from creating an unmanageable hairball later.

We were discussing a data import that I was working on using XML as the importable file format, and I voiced the opinion that the expressive power of DTD sucks as a means of defining what a valid XML document looks like, especially in the area of internal consistency. (For example, there is support for IDs and IDREFs, which you could use to link entities together as in a relational database's foreign key constraint, but IDs have to start with an alphabetic character, which is unlike most of the real-world ID's I've seen.) His response was that this wasn't a problem since modern relational databases already have robust consistency checking, and that it makes no sense to duplicate that consistency checking in an XML parser. I disagree in the general case, although in this specific case I see his point.

To illustrate my point I'll use a similar point on which we had previously agreed. Java has compile time testing for things like syntax, data type, and initialization, and it also has runtime type checking for things like null pointers, data types, and array bounds checking. This helps software written in Java safer and more defect-free than in other languages which rely on just runtime or just compile-time checking. Sure, you still write code with defects, but you get "busted" by the compiler or runtime sooner which makes the overall cost of fixing the defect lower. (I'm not going to bother backing up the assertion that detecting defects earlier makes fixing them cheaper because if you've studied software engineering or done it you know this to be true.)

Using a syntax-aware editor (such as GNU Emacs) or a language-aware IDE (such as JBuilder) which can alert you to syntax errors or, even better, undefined variables *as you type* reduces the cost of a certain type of very common defect (such as the forgotten semicolon or parenthesis, or a plain old misspelling typo) by several orders of magnitude over finding it at runtime. For example, JavaScript is compiled in some cases but is similarly lax about type checking; this makes it OK for developing little validation scripts but it makes it very hard to write large robust programs. C and C++ do all checking at compile time, which is good for finding bugs early, but since there is no checking done at runtime, programs written in C and C++ are susceptible to some really dramatic crashes and are a pain in the butt to debug. C and C++ programmers compensate by writing all kinds of Assert_ macros to test things, by building smart pointers and other custom runtime testing facilities, by using some really clever and complex debugging tools, by testing the bejeezus out of their programs, and by slowly and painstakingly squashing bugs in release after release. I don't like either approach (just compile time or just runtime checking) compared to having both.

Compare this multilayered testing approach to XML and a relational database. Sure, an RDBMS will do consistency checking, but I think that the XML document format should be able to be specified in such a way as to define some consistency checking rules about the document on its own. As an example of how this would pay off, consider an XML based message used in a supply chain management system in which asynchronous messages like "ship me 1000 widgets, here are the details" or "here's the new price list for your company's purchases of our products" are passed from one company to another. The systems are totally different but the XML messages and transport mechanism are the point of integration. Is it wise to wait until the recipient processes the message (perhaps hours or days later) and the RDBMS consistency rules reject it? How about when you're a developer for a company implementing a new integration product - should you have to just shoot messages at a known-working compatible product and see what happens to find out if it works? I say no. I say that publishing a specification for the XML message format that includes a reasonable amount of checking helps, because the sender of the message can be assured to a certain degree that the message won't be rejected at the other end. Also, XML message routers (they forward incoming messages to the appropriate internal server based on the content of the XML - pretty cool stuff) would be able to prevent stupid denial of service attacks (deliberate or due to incompetent programmers at the partner company) involving spamming the company's big important server with bogus messages meant to keep it busy with transactions that will run almost all the way and then fail.

One potential solution is to add code on the receiving end do some pre-validation before trying to process the message. However, this just adds work to the recipient and doesn't do anything to prevent invalid messages from being transmitted. Another solution is to add code (possibly provided by the recipient) to the sender's system, but this can be difficult to accomplish due to intellectual property issues, incompatibility of technologies (what if the sender uses Perl on Unix but you ship a COM object?), and lack of trust (are all the senders willing to run a black-box piece of code on their system?). I think the best solution is to make the standard means of describing the document stronger, so everyone who uses a standards-compliant XML parser / validator can just validate their message against your document description and know if it's reasonable to expect that it can be processed. Of course there will be things you have to have access to the recipient's database to do (checking credit or inventory for an order, or authenticating the message sender against the recipient's access list) but in general the faster you can reveal errors to developers, the better. You can always turn validation off in production (once you are confident that all your outgoing messages are valid) if performance becomes an issue.

I haven't used XML Schema in a real world application yet but it seems like this is the solution to the failings of the DTD, or at least it's trying to be. If you're developing with XML, I strongly recommend that you look into a more robust means of specifying what a valid XML document looks like such as XML Schema or one of its competitors, and migrate off of DTD as soon as possible - your system will be more robust and you'll get it working sooner as a result.

-Jamie 5/16/2001