[dsdl-discuss] Re: Use cases for DSDL Part 6 Path based integrity constraints (discussion draft)

From: Rick Jelliffe <rjelliffe@allette.com.au>
Date: Tue May 27 2008 - 06:15:49 UTC

MURATA Makoto wrote:
> Rick,
>
>
>> One of the long-running issues with Part 6 has been the lack of an
>> adequate set of use cases.
>>
>
> I also tried to enumerate design choices in my mail
> "[dsdl-discuss] Re: FW: [office-comment] Ambiguity problems caused by a:defaultValue"
>
Yes, you wrote 12/05/2008 5:02 PM
---------------------

1) It should be simple. Something similar to key/keyref would be good
   enough.
2) It should be streamable.
3) Checking of identity constraints shall be separated from RELAX NG
   validation.
4) Given descriptions in this language, it should be easy for
    application programmers to create programs for maintaining
    hashtables.
5) It would be nice if descriptions in this language can be embedded
    within RELAX NG schemas.

We can spend a lot of time if we forget 1). Researchers do that, but
DSDLers shouldn't.

 ------------------------

and you also had some other specific use cases.

I agree with all these, however I think we now have many examples in the
real world of integrity constraints (e.g. link checking of all kinds)
which do not have document scope, which is a great opportunity for us to
re-think what we are doing. So your 1) is in the context of a larger
scope.

We should also consider building on our existing standards. For example,
could NVDL be upgraded to wrap several different documents with a single
artificial root in a way that lets us use RELAX NG and Schematron to
constraint the documents that must be available, and which thereby
simplifies the paths (e.g. a little akin to XQuery's multi-document
assumption)?

>> DSDL Part 6 - Path-based integrity constraints should be developed to
>> meet the following requirements:
>>
>> 1) It should allow declaration and checking of all the constraints of
>> SGML, XML 1.0 and XML 1.1 IDs:
>> * The lexical form of the ID (in particular, matching different
>> naming rules)
>> * There can only be one ID attribute per element
>>
>
> Is this really a good restriction? We might want
> to assign an English key attribute as well as a Japanese
> key attribute. Unlike DTDs, Part 6 should allow more than
> one set of keys.
>
I am just enumerating the constraints if XML IDs. Whether it is factored
into its own declarations, or generalized, or ultimately dumped, I think
it should at least be in the mix for the requirements.

As to its goodness: having an array of unique identifiers allows exact
matching, and perhaps is required to model DBMS primary key.

However, I don't see that being able to support a superset of XML ID
means that it should be impossible to have alternate mutually exclusive
primary keys, such as English ones and Japanese ones. For example, using
the XSD KEY/KEYREF/UNIQUE mechanism plus CharRef might be useful.
>> * The data values of IDs must be document Unique
>>
>
> I agree with this restriction, although
> scoped keys have been used by W3C XML Schema.
>
I am not saying that all keys, just that we should a syntax that maps
directly to supporting XML IDs, as well as any support for more complex
mechanisms.
>> 4) It should allow declaration and checking of all constraints of XML
>> Schemas KEY, UNIQUE and KEYREF mechanisms.
>>
>
> I am not sure if we need UNIQUE. The only point of UNIQUE is
> to make keys optional. Do we need this optionality?
>
I don't think UNIQUE makes keys optional: it declares that values may
not be duplicate within a scope. That is essential.
>> 5) It should allow declaration and checking of simple links, in
>> particular HTML a/href, image/src, head/meta, ODF package-internal
>> links, and XML pi()[name()='xml-stylesheet']
>> * This should cope with URLs which contain fragment references
>>
>
> Some of your examples require foreign keys. I missed them in my
> previous mail.
>
> Now that OOXML heavily uses foreign keys, I think that
> it is probably a good idea to cover foreign keys.
>
Yes, that may a good way to describe what we are doing. I prefer to
think in terms of links, but that is not important.
>> 6) It should allow checking of two stage links, in particular SGML and
>> XML SYSTEM identifiers through OASIS XML Catalogs, OOXML OPC links
>>
>
> This makes sense. I think that this is possible by using
> two sets of keys. The first set is for the first stage, while
> the second one, the second stage.
>
Yes. I don't see any need to go more than one level of indirection, I'd
say convenient syntax would be more important than generality.
>> 7) It should allow checking of external links and external markup, in
>> particular complex XPath link bases.
>> * There is no necessity to provide a declaration mechanism: the XPath
>> declarations are adequate
>>
>
> Yes, foreign keys.
>

>> 7) It should allow validation of reached documents or media files
>> accessed through links.
>> * The reachability of the resource
>> * That the resource is the format expected by anchor-side metadata
>> * That the resource is the format labelled using the target-side
>> metadata (e.g. in a MIME header)
>> * That metadata associated with the resource is correct (e.g. for
>> read-write)
>>
>
> This is intereseting and very web-oriented. But are these features
> identity constraints?
>
Part 6 is Path-based integrity constraints. I think that link integrity
definitely includes the questions "does this locate some resource?" and
"Is the resource what it says it is?" and "Is the resource what I am
expecting". For example, if we allow that integrity checking includes
checking (in effect) the Xpath
    document('xxx.xml')//*id=current()/*refid
then certainly this presupposes the checks that xxx.xml exists, that it
says it is XML file (e.g. in MIME header), and that it is at least
well-formed XML.

The metadata checking is quite interesting I think. If we make
technologies assuming a webby world (and what are we doing if not that?)
and we recognize that documents are not constructed as single XML
documents anymore, but as websites or XML-in-ZIP files, then we should
admit that some information that previously would have been available
inside our big SGML document is not now available in markup but only as
web metadata.

I remember James making a comment about this around the time of the
WebSGML discussions at SC34, that one of the big differences between the
SGML/HyTime and the WWW/HTTP approach was that the SGML kind of approach
had the metadata for entities at the declaration for the object (e.g.
using SGML data attributes) while in the web approach, the metadata was
located at the resource: resources were more "self-describing". For
this reason, SGML data attributes were not appropriate for adoption into
XML.

SGML provided the ability to validate data attributes: you could say "At
this point, you can put in an image which can be notation pdf, jpeg,
tiff, or gif but nothing else". In SGML DTDs, the notation of entities
(targets of links) was a schema issue that was fair game for
validation. In XML we lost this capability, and I think we should
consider reconstructing it: to do so means being able to access MIME
metadata on HTTP in effect.

>> 8) There is no requirement or expectation that this should be
>> implementable using XSLT 1 or 2. However, streamable implementation in
>> one or two passes should be favoured.
>>
>
> I have been thinking about the use of STX. I
> am pessimistic now. In my understanding, STX does
> not allow programmers to create a key set and check
> if a referenced key already exists in the key set.
>
It can in two passes. And if we have a declarative mechanism, there is
more scope for optimizations that perform checks in one pass only. But
I would not want to make STX support a requirement just a consideration;
and streamability != STX however it is an important technology we
shouldn't dismiss in any way, IMHO.

Cheers
Rick Jelliffe

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Tue May 27 08:14:44 2008

This archive was generated by hypermail 2.1.8 : Thu May 29 2008 - 08:03:06 UTC