[dsdl-discuss] Re: Response to rest of Martin's comments

From: Martin Bryan <martin@is-thought.co.uk>
Date: Tue Apr 13 2004 - 13:21:58 UTC

Rick

I'll chop your responses around a bit to make my comments more coherent.

A) The origin of any names for character set blocks used, whether
user-defined or from a standard, needs to be explicitly stated, and not
implied as at present.

So, re:

> In fourth example, where does IsSmallFormVariants name come from?
>
> That is a Unicode Block name, for the character range
> U+FE50 to U+FE6D, see
> http://www.unicode.org/Public/4.0-Update1/Blocks-4.0.1.txt

Firstly let me note that the referenced property identifier is not specified
at this URL, only the "Small Form Variant" name for the character set block.
This name is clearly derived from Figure 3 of ISO 10646. One of the clear
rules in JTC1 is that we *must* reference all ISO standards from which we
derive anything, hence my request:

> > Reference should be added to ISO 10646 as default source of character
set
> > naming conventions
>
> ISO 10646 names blocks and characters, but it does not name properties
> of characters.[1]

We are not naming properties, only blocks (or is IsSmallFormVariant defined
as a property somewhere in the Unicode spec you have not identified?)

> These come from the Unicode Consortium, which provides
> the semantic layer on top ISO 10646.

If we are using something additional from the Unicode spec this must be
identified *as well* using a URL that points to the specific list of things
being referenced (i.e. wherever the rules for deriving Is..... names from)

> I believe that ISO 10646 uses "collections" for its unit. I think
> ISO 10646 is only formally relevant, what we are interested in
> is Unicode's properties in general.

Surely we are not going to allow users to use any Unicode property, only
those used to name sets!

> If ISO 10646 is not really appropriate, what about something
> from the Unicode Constortium? Well, their spec on Regular Expressions
> and character classes[3] do have a syntax, and indeed that syntax
> is based on Perl's as is XML Schemas, but it is only a demonstrative
> syntax not one intended to have normative properties. So Unicode
> is also not directly useful as a normative reference for
> defining the language used for assertion tests.

The use of Unicode compliant REs should be up to the application, and there
should be a way of clearly defining which set of rules are being applied for
interpreting regular expressions so that people can transform them as
required by their local processing environment.

> In the W3C WG on XML Schemas, we took that REGEX document in
> consideration when deciding on the design of the W3C regular
> expressions. (Please note that the version of the Unicode REGEX
> would be one from 1999, not the most recent version!)
>
> Schematron's technical approach has been to provide a framework
> by which existing common libraries can be readily used: the
> XSLT libraries for default Part 3. Similarly, in the Part 7
> draft, the default binding is to allow the most common, good-quality
> library but also allow better, rarer, and more specialist alternative
> implementations as users demand and implementer's provide.

Should we default to something different for Part 7 from that used for Part
3, or should we insist on a single default mechanism?

> Again, the big thing is to flee as fast as possible any inclination
> to extend or define our own syntax: we need to be infected with
> a reverse NIH syndrome that allows us to have small, excellent,
> readily implementable standards that are adequate for the use-cases.
> XML Schemas is rightly criticized for betraying XML's "good
> enough to declare victory" in favour of kitchen-sinkery, and ISO
> DSDL needs to take the design approach that better accords with
> XML's.

While I agree we should not define a fixed method I feel we need to provide
a default mechanism so that users have a start point. Ideally I'd like
something that was minimal that all applications could support.

> > Clause 4
> > The value of the language attribute specified in para 2 should reflect
the
> > source of the definitions, and not reference the xpath spec. It should
be
> > either DSDL-charrep, DSDL-7, ISO19757-charrep or ISO19757-7.
>
> If we are putting authority in, surely we should make them URLs.
> Does ISO have any mechanism for persistant URLs? (In fact, I was
> trying to avoid the complexity of long names and the URL/PURL/Public
> identifier controversy by keeping the names as short as possible.)

We have determined that we need such a thing for other parts. SC34 have a
built in naming convention for assigning public identifiers to ISO standards
that it behoves us to use, if only to prove we stick to our own rules.

> I don't see it as a big issue, but part of the benefit of using
> xslt-charrep is to for the default is that it clearly suggest what
> is going on. Assuming that no users would buy the spec, looking
> at a raw schema and seeing things that look like XPaths and the
> string "xslt" should give anyone with half a clue a pretty good idea.

If anyone goes to the XSLT spec expecting to find out what xslt-charrep is
they are going to be disappointed. If they go to DSDL Part 7 to find out
what DSDL7-charrep is I trust they will not be disappointed.

> > This clause currently fails to explain:
> > 1) The role of the schema, title and pattern wrapper elements
>
> All these things are belong in Part 3.

Good, then a) let's see them in a draft of Part 3 and b) clearly reference
the relevant clauses of Part 3 as the source for these parts of the spec. In
particular we need a normative reference to Part 3, and an identification of
the clause in Part 3 that covers them within Clause 4 of Part 7.

> > 2) The origin of the names used to refer to character sets (which
should
> > make reference to ISO 10646 names at least)
>
> See above.

We still need a sentence that explains the origin. If we reference something
declared in another document it is not enough just to list that document in
the normative references. The main text must clearly state which parts of
the referenced specs are to be applied and how they are to be applied. At
present Clause 4 fails to do this.

> > 3) How user-defined character sets can be defined and named.
>
> Do you mean character repertoires? If so, then a Part 7 Schema
> can have a system identifier (a URL) and can have an SGML
> Formal Public Identifier too, like any SGML document.

No, I mean the character repetoires as required by users, as defined locally
or externally, either by reference to local IDs or by XPath references to
sets in external documents.

> Or do mean that you want a way to test that a (Unicode) XML document
> only contains characters from a particular character encoding
> such as Windows CP1252?

I want a way to say that IsMyFrenchCharacter refers to a list of permissible
characters I have defined, not a predefined set in the ISO list.

>Yes a schema can be made which constraints
> documents to any character encoding. Pleeeeese don't ask me to
> do this as part of the text of the spec: there are just too many
> character sets.

That's why I referenced Diedrick's text, which has a list of the initial ISO
ones we can copy without any difficulty. These can quickly be listed in a
non-normative appendix to give people a leg up.

> I think this will also be answered by the same
> mechanism as I will put in for Murata-san's request for school
> kanji: an <include> element. (Previously, I had been intending
> to put in a <library> element at the top-level only, as a more
> general mechanism more similar to XML Schema's include mechanism,
> but I think doing something more like RELAX NG include meets
> Murata-san's requirements and promotes a "family approach"
> for DSDL specs.)
>
> Or do you mean you want to be able to name repertoires and use
> those names inside sch:assert/@test expressions?

That was my main intention.

>Part 7
> already inherits, from Schematron, the ability to have
> both parallel assertions (i.e. within the same rule, or
> in different patterns), assertions in mutually exclusive contexts
> (i.e. in different rules in the same patterns). So there
> is no extra power gained from, say, having a macro mechanism
> to preprocess sch:assert/@test attributes to insert named
> expressions: but actually we do have such a mechanism:
> abstract patterns.

But you need to give examples of how these rules apply, otherwise people
will think they do not exist. If I don't see this mapping with my knowledge
how can other users, especially those who have yet to hear of Part 3, be
expected to know them?

> More particularly, the abstract rule mechanism allows you
> to define a complex expression inside a pattern, then
> to reuse it in different rules in the same pattern. Similarly,
> the abstract pattern mechanism allows you define a more
> complex set of rules. For example, you could have inside
> a schema
>
> <!-- include the Schematron definitions for an abstract
> pattern that checks -->
> <sch:include src="http://www.eg.com/charset/cp1252.sch"/>
>
> <sch:pattern name="Windows" is-a="WindowsDocument">
> <param name="check" value="*" />
> <sch:pattern>
>
> where the referenced schema fragment would be something like
>
> <sch:pattern name="WindowsDocument" abstract="true"
> xmlns:sch="http://www.ascc.net/xml/schematron">
> <sch:rule context=" $check ">
> <assert test="\p{IsBasicLatin}\p{IsLatin-1Supplement}
> &#x2010; &#x0192; &#x201E; &#x1026;
> &#x1020; &#x2021; &#x2C2; &#x2030; &#x160;
> &#x2039; &#x152; &#x2018; &#x2019; &#x201C;
> &#x201D; &#x2022; &#x2013; &#x2014;
> &#x2DC; &#x2122; &#x161; &#x203A; &#x153; &#x178;
> </sch:rule>
> </sch:pattern>

Then let's walk people through this example, or something equally complex,
without presuming they know Schematron before they read Part 7. If people
are only interested in character set validation then they should be able to
read about that without having to read the other parts of the standard
first.

> > There also needs to be another appendix listing all the existing
character
> > set names from the Multiplane set as these provide default names for
> > character sets. This list can be taken from Diedrick's 2003-11-17 text
for
> > Part 7. In general we need to add many of the functions provided by
> > Diedrick's draft for allowing users to define their own character
subsets
> > before this can become a realistic CD text, though it makes a good first
WD.
>
> I think this would be completely the wrong approach, if it had any
> normative status! We don't want to define properties ourselves jsut
> the framework and the references for the default values. Properties
> are ultimately the job of Unicode Consortium, and proximately
> the job of the W3C XML Schema WG (or other implementers) who decide
> which properties are interesting and how they should be named
> (which is their perrogative according to the Unicode guidelines).

I wasn't thinking of giving it normative status, simply providing a
non-normative listing so that people can interpret the examples correctly.

> In other words, the concrete requirement is to decide whether
> a particular library (i.e. W3C character classes) meets our
> minimal use cases, not to define a set of maximal use cases
> which nothing currently implements. It is not defining a new
> language, but using the Schematron framework to bring in
> an existing language, and then verifying that it meets
> all of our "must-haves" and enough of our "should-haves" to
> be useful.

My aim is to have something that is flexible *but clearly understandable".
At the moment it is only flexible.

Martin

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Tue Apr 13 15:22:10 2004

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC