[dsdl-discuss] Re: Reformulation of MNS for DSDL Part 4

From: Rick Jelliffe <ricko@topologi.com>
Date: Fri Apr 11 2003 - 06:29:30 UTC

From: "MURATA Makoto" <murata@hokkaido.email.ne.jp>

> First, I am not sure if Draconian validation (i.e., if invalid then halt)
> is really needed. Neither RELAX NG nor DTDs provide such mechanisms.

DTDs certainly halt when a content model validity error is found, in most (all?) implementations I have seen. In fact, one of the talking points on Schematron
when it came out was that it didn't just stop on the first error. (And SGML
implementatins certainly halted on many kinds of errors: OmniMark
was great for attempting to restart and continue in the face of errors,
but there were some they could not recover from.)

To give an example, my company is just about to release a proxy validator:
it is an Apache servlet that validates incoming XML and, if valid, onsends it
to the real service or, if invalid, transforms the document and sends it to
URIs nominated in the configuration file.

For this application, we are not interested in diagnostics; all that counts is
finding "valid/invalid" and working as fast as possible. Because it is a proxy,
we want to try to get a response back as fast as possible and we don't want to
waste time contuing with validation when other things have failed.

> Second, if we impose Draconian validation, I believe that we have to ensure
> that any implementation stops at the same place. If we do not impose
> a particular order in which validation candidates are validated, users will be
> confused by entirely different behaviours of different processors.

The result is always the same at the top-level: valid, neutral or invalid.
The diagnostics and elements may have changed, but that is part of the
bargain. Schemachine's <pass>es allowed validations to run in sequence,
which would give the same result across different implementations.

Users will be more confused by millions of redundant or spurious error
messages when an early error triggers later ones. Indeed, this was one
reason for phases in Schematron: it is quite possible for one problem to
cause many different messages. Xerces has this problem too: one error
can sometimes cause three or four error messages: the extra messages
feel like spam.

For Schematron running under DSDL, one of the reasons Schemachine
has passes is so that a Schematron schema's phases can be run in some
order, so that broad errors can be found first. For example, say we have
 <rule context ="row">
        <assert test="count(entry) &lt= ../@cols">
        In a table, the number of columns must correspond to the number
        of cols
        </assert>
</rule>
and the document fragment
 <table col="3">
        <row>
            ....
        </row>
            .. 99 more rows
  </table>
then Schematron will report 100 errors, one per row, because of the
spelling mistake of "col" rather than "cols".

Now we could get the user to give the assertion
   "not(../@cols) or count(entry) &lt;= ../cols"
but if we have already had a grammar-based schema check that
cols is a required attribute, then it is unnecessary to check further.

I guess the nub of the problem is how to make modular schema
languages work together effectively: to prevent unnecessary processing,
detect errors early, and not generate duplicate or redundant error messages.

If "halt" seems to procedural, then perhaps an attribute like
severity="fatal" might be more acceptable? So if the DSDL implementations
supports (and is invoked with) "halt-on-severe-error" and a schema
has a fatal error, the rest of the validations in that part 4 module terminate.

If part 4 takes over functionality from the framework, it has to have
features to address these kinds of requirements.

>> Let us say George Bush takes over DOCBOOK and give it a new namespace.
>> However, it has the same structure as OASIS DOCBOOK documents.

> I now understand. Your observation about Schematron applies to RELAX NG
> since it also uses namespace URIs to specify versions of RELAX NG.

Yes, indeed it applies to all use of namespaces for documents.
Cool URLs don't change; however, cool companies do get bought
up and their names change. And cool standards get developed by
one body and then adopted or further developed by another.

> Right. Moreover, you certainly disallow the mixture of two DOCBOOK
> namespaces in a document.

I don't think so. That is an extra validation requirement that can be
handled by Schematron, for example. If someone cuts and pastes a fragment
from OASIS DOCBOOK into a George Bush DOCBOOK document,
and the editor is smart enough to move the namespace but not smart
enough to correct it, the user should not have to suffer.

At most we could make it a DSDL constraint that a document cannot contain
elements from namespaces A and B if namespaces A is an alias of namespace
B. Or a feature, or command-line option. But I don't see why it is necessarily
wrong for a document to have elements from aliased namespaces: a document
constructed from fragments of different documents is the example.

Cheers
Rick Jelliffe

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Fri Apr 11 08:25:33 2003

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:27 UTC