[dsdl-discuss] Re: Response to rest of Martin's comments

From: Rick Jelliffe <ricko@allette.com.au>
Date: Wed Apr 14 2004 - 09:34:50 UTC

(Rearranged)

I am sending out new drafts for Part 3 and Part 7 today!

Martin Bryan wrote:

>>I believe that ISO 10646 uses "collections" for its unit. I think
>>ISO 10646 is only formally relevant, what we are interested in
>>is Unicode's properties in general.
>>
>>
>
>Surely we are not going to allow users to use any Unicode property, only
>those used to name sets!
>
Yes and no. My proposal is to use *full* XML Schemas character classes.
However,
these do not expose all the Unicode properties. (So the full database is
not needed:
just the most useful ones.) For example, referencing characters by name
is not
included.

I believe there is little difficulty with implementation. There are
libraries available in
most (all) XML Schemas implementations such as Xerces, libxslt and .NET,
and
other implementations such as ICU4C and ICU4J. As well, the regular
expression
libraries in Perl and Java are very close. Also, the major platforms
such as Win32,
.NET, Java and Max OS X all supply

Apart from practicality and implementability, whether the character
block names
are useful is a sheer matter of chance. I think a character repertoire
validator
that does not allow testing characters against the basic Unicode properties
would be too cumbersome to use. For example in the use-cases is a
<sch:assert test="^\[{Lm}">
This document should not use any Latin combining characters
</sch:assert>

Also, please note that XML Schemas character classes provide no
facility to define user-defined classes, consequently neither does my
proposal.

I think these misunderstandings (i.e., that only ISO 10646 block names
are allowed, and that there is some user-definable tests) are the root
of the other questions about where ISO 10646 is referenced.
I believe there is no more need to reference ISO 10646 for block names
as there is to reference Unicode Consortium. XML already references
ISO 10646, XML Schemas references XML and Unicode Consortium,
and Unicode Consortium references ISO 10646.

I am, of course, happy to put in some kind of note to emphasise that
the blocks used in Part 7 are XML Schemas blocks are the Unicode blocks
are the ISO 10646 blocks, for clarity or conformance. But for versions,
the specific version of ISO 10646 is known by finding the latest version of
XML Schemas, tracking which version of Unicode it uses, then tracking
from that version of Unicode to the appropriate version of ISO 10646:
so directly referencing a particular version of ISO 10646 would be
misleading.

>A) The origin of any names for character set blocks used, whether
>user-defined or from a standard, needs to be explicitly stated, and not
>implied as at present.
>
>
See above. There are no user-defined block names.

>So, re:
>
> > In fourth example, where does IsSmallFormVariants name come from?
>
>
>>That is a Unicode Block name, for the character range
>>U+FE50 to U+FE6D, see
>>http://www.unicode.org/Public/4.0-Update1/Blocks-4.0.1.txt
>>
>>
>
>Firstly let me note that the referenced property identifier is not specified
>at this URL, only the "Small Form Variant" name for the character set block.
>This name is clearly derived from Figure 3 of ISO 10646. One of the clear
>rules in JTC1 is that we *must* reference all ISO standards from which we
>derive anything, hence my request:
>
>
See above.

>
>
>>>Reference should be added to ISO 10646 as default source of character
>>>
>>>
>set
>
>
>>>naming conventions
>>>
>>>
>>ISO 10646 names blocks and characters, but it does not name properties
>>of characters.[1]
>>
>>
>
>We are not naming properties, only blocks (or is IsSmallFormVariant defined
>as a property somewhere in the Unicode spec you have not identified?)
>
>
See above.

>>These come from the Unicode Consortium, which provides
>>the semantic layer on top ISO 10646.
>>
>>
>
>If we are using something additional from the Unicode spec this must be
>identified *as well* using a URL that points to the specific list of things
>being referenced (i.e. wherever the rules for deriving Is..... names from)
>
>
See above.

>>I believe that ISO 10646 uses "collections" for its unit. I think
>>ISO 10646 is only formally relevant, what we are interested in
>>is Unicode's properties in general.
>>
>>
>
>Surely we are not going to allow users to use any Unicode property, only
>those used to name sets!
>
See above.

>>If ISO 10646 is not really appropriate, what about something
>>from the Unicode Constortium? Well, their spec on Regular Expressions
>>and character classes[3] do have a syntax, and indeed that syntax
>>is based on Perl's as is XML Schemas, but it is only a demonstrative
>>syntax not one intended to have normative properties. So Unicode
>>is also not directly useful as a normative reference for
>>defining the language used for assertion tests.
>>
>>
>
>The use of Unicode compliant REs should be up to the application, and there
>should be a way of clearly defining which set of rules are being applied for
>interpreting regular expressions so that people can transform them as
>required by their local processing environment.
>
>
That is impossible for interoperability. The schema/@language attribute
of the framework
already allows implementers to use different REGEX libraries, but the
default schema language
needs to be something fixed.

>>In the W3C WG on XML Schemas, we took that REGEX document in
>>consideration when deciding on the design of the W3C regular
>>expressions. (Please note that the version of the Unicode REGEX
>>would be one from 1999, not the most recent version!)
>>
>>Schematron's technical approach has been to provide a framework
>>by which existing common libraries can be readily used: the
>>XSLT libraries for default Part 3. Similarly, in the Part 7
>>draft, the default binding is to allow the most common, good-quality
>>library but also allow better, rarer, and more specialist alternative
>>implementations as users demand and implementer's provide.
>>
>>
>
>Should we default to something different for Part 7 from that used for Part
>3, or should we insist on a single default mechanism?
>
>
If there is a better alternative library around, we could also offer that.

Or, we could also define a subset of XML Schemas character classes that
only
offers characters and ISO 10646 blocks, which would be quite easy for
implementers.

However, since one of our goals is to support validation of mixed content,
as a selling point to publishers for example, which XML Schemas does not
support at all, I think it is better to provide the same kind of power
that XML
Schemas character classes have, otherwise we are "pulling our punches"
and maybe "shooting ourselves in the foot".

>While I agree we should not define a fixed method I feel we need to provide
>a default mechanism so that users have a start point. Ideally I'd like
>something that was minimal that all applications could support.
>
>
In this particular case, the libraries for doing this are widely
available. So I suspect that
is easier to wrap the test string in "[" and "]*" and run it through
Xerces' routines
rather than write a little parser and implementation.

>>>Clause 4
>>>The value of the language attribute specified in para 2 should reflect
>>>
>>>
>the
>
>
>>>source of the definitions, and not reference the xpath spec. It should
>>>
>>>
>be
>
>
>>>either DSDL-charrep, DSDL-7, ISO19757-charrep or ISO19757-7.
>>>
>>>
>>If we are putting authority in, surely we should make them URLs.
>>Does ISO have any mechanism for persistant URLs? (In fact, I was
>>trying to avoid the complexity of long names and the URL/PURL/Public
>>identifier controversy by keeping the names as short as possible.)
>>
>>
>
>We have determined that we need such a thing for other parts. SC34 have a
>built in naming convention for assigning public identifiers to ISO standards
>that it behoves us to use, if only to prove we stick to our own rules.
>
>
Great! What are they? Is that the :: form?

>>> 2) The origin of the names used to refer to character sets (which
>>>
>>>
>should
>
>
>>>make reference to ISO 10646 names at least)
>>>
>>>
>>See above.
>>
>>
>
>We still need a sentence that explains the origin. If we reference something
>declared in another document it is not enough just to list that document in
>the normative references. The main text must clearly state which parts of
>the referenced specs are to be applied and how they are to be applied. At
>present Clause 4 fails to do this.
>
>
Hmm. I give the production number for character classes (and for XSLT
path expressions and
XPath). Why don't these adequately scope the parts of the referenced
specifications.
(Furthermore, the URL in the normative reference refers to a particular
division in
the W3C XML Schemas Datatypes recommendation.)

But I will try to figure something better out.

>>>3) How user-defined character sets can be defined and named.
>>>
>>>
>>Do you mean character repertoires? If so, then a Part 7 Schema
>>can have a system identifier (a URL) and can have an SGML
>>Formal Public Identifier too, like any SGML document.
>>
>>
>
>No, I mean the character repetoires as required by users, as defined locally
>or externally, either by reference to local IDs or by XPath references to
>sets in external documents.
>
>
>
>>Or do mean that you want a way to test that a (Unicode) XML document
>>only contains characters from a particular character encoding
>>such as Windows CP1252?
>>
>>
>
>I want a way to say that IsMyFrenchCharacter refers to a list of permissible
>characters I have defined, not a predefined set in the ISO list.
>
>
I am sending separately a new version of my suggested Part 7 which gives
examples of how to use the Part 3 mechanisms to accomplish this kind of
thing
using text substitution.

>>Part 7
>>already inherits, from Schematron, the ability to have
>>both parallel assertions (i.e. within the same rule, or
>>in different patterns), assertions in mutually exclusive contexts
>>(i.e. in different rules in the same patterns). So there
>>is no extra power gained from, say, having a macro mechanism
>>to preprocess sch:assert/@test attributes to insert named
>>expressions: but actually we do have such a mechanism:
>>abstract patterns.
>>
>>
>
>But you need to give examples of how these rules apply, otherwise people
>will think they do not exist. If I don't see this mapping with my knowledge
>how can other users, especially those who have yet to hear of Part 3, be
>expected to know them?
>
>
Done.

>>More particularly, the abstract rule mechanism allows you
>>to define a complex expression inside a pattern, then
>>to reuse it in different rules in the same pattern. Similarly,
>>the abstract pattern mechanism allows you define a more
>>complex set of rules. For example, you could have inside
>>a schema
>>
>><!-- include the Schematron definitions for an abstract
>> pattern that checks -->
>><sch:include src="http://www.eg.com/charset/cp1252.sch"/>
>>
>><sch:pattern name="Windows" is-a="WindowsDocument">
>><param name="check" value="*" />
>><sch:pattern>
>>
>>where the referenced schema fragment would be something like
>>
>><sch:pattern name="WindowsDocument" abstract="true"
>>xmlns:sch="http://www.ascc.net/xml/schematron">
>><sch:rule context=" $check ">
>><assert test="\p{IsBasicLatin}\p{IsLatin-1Supplement}
>>&#x2010; &#x0192; &#x201E; &#x1026;
>>&#x1020; &#x2021; &#x2C2; &#x2030; &#x160;
>>&#x2039; &#x152; &#x2018; &#x2019; &#x201C;
>>&#x201D; &#x2022; &#x2013; &#x2014;
>>&#x2DC; &#x2122; &#x161; &#x203A; &#x153; &#x178;
>></sch:rule>
>></sch:pattern>
>>
>>
>
>Then let's walk people through this example, or something equally complex,
>without presuming they know Schematron before they read Part 7.
>
Done.

>If people
>are only interested in character set validation then they should be able to
>read about that without having to read the other parts of the standard
>first.
>
>
I was asked to prepare a version of the character repertoire validator
as a query language
binding of Part 3. That it requires Part 3 to understand is the point,
and it is the thing
that allows it to be, essentially, a two page spec.

>
>I wasn't thinking of giving it normative status, simply providing a
>non-normative listing so that people can interpret the examples correctly.
>
>
OK. I will try to have this for the next version.

But does that mean, by the same token, that part 7 also requires a XLST
path expression
tutorial. And that Part 3 requires this tutorial as well? And what
about URL syntax?
The reason to reference external specs is to avoid having to use them.
Surely, the proper
place for tutorial material is not in the spec. The advantage of using
W3C specs is that
there is a wealth of material already available. Where do we stop?

>>In other words, the concrete requirement is to decide whether
>>a particular library (i.e. W3C character classes) meets our
>>minimal use cases, not to define a set of maximal use cases
>>which nothing currently implements. It is not defining a new
>>language, but using the Schematron framework to bring in
>>an existing language, and then verifying that it meets
>>all of our "must-haves" and enough of our "should-haves" to
>>be useful.
>>
>>
>
>My aim is to have something that is flexible *but clearly understandable".
>At the moment it is only flexible.
>
>
I think it may also depend on one's familiarity with XML Schemas
datatypes also.

(By the way, thanks for all the comments, Martin: I hope my answers are
not too thick.)
Cheers
Rick

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Wed Apr 14 11:35:10 2004

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC