[dsdl-discuss] Response to rest of Martin's comments

From: Rick Jelliffe <ricko@allette.com.au>
Date: Tue Apr 13 2004 - 10:03:15 UTC

Martin Bryan wrote:
> My initial (unofficial comments) on Rick's draft are:

> Reference should be added to ISO 10646 as default source of character set
> naming conventions

ISO 10646 names blocks and characters, but it does not name properties
of characters.[1] These come from the Unicode Consortium, which provides
the semantic layer on top ISO 10646.

I believe that ISO 10646 uses "collections" for its unit. I think
ISO 10646 is only formally relevant, what we are interested in
is Unicode's properties in general.

If ISO 10646 is not really appropriate, what about something
from the Unicode Constortium? Well, their spec on Regular Expressions
and character classes[3] do have a syntax, and indeed that syntax
is based on Perl's as is XML Schemas, but it is only a demonstrative
syntax not one intended to have normative properties. So Unicode
is also not directly useful as a normative reference for
defining the language used for assertion tests.

In the W3C WG on XML Schemas, we took that REGEX document in
consideration when deciding on the design of the W3C regular
expressions. (Please note that the version of the Unicode REGEX
would be one from 1999, not the most recent version!)

Schematron's technical approach has been to provide a framework
by which existing common libraries can be readily used: the
XSLT libraries for default Part 3. Similarly, in the Part 7
draft, the default binding is to allow the most common, good-quality
library but also allow better, rarer, and more specialist alternative
implementations as users demand and implementer's provide.

Again, the big thing is to flee as fast as possible any inclination
to extend or define our own syntax: we need to be infected with
a reverse NIH syndrome that allows us to have small, excellent,
readily implementable standards that are adequate for the use-cases.
XML Schemas is rightly criticized for betraying XML's "good
enough to declare victory" in favour of kitchen-sinkery, and ISO
DSDL needs to take the design approach that better accords with
XML's.

> Clause 4
> The value of the language attribute specified in para 2 should reflect the
> source of the definitions, and not reference the xpath spec. It should be
> either DSDL-charrep, DSDL-7, ISO19757-charrep or ISO19757-7.

If we are putting authority in, surely we should make them URLs.
Does ISO have any mechanism for persistant URLs? (In fact, I was
trying to avoid the complexity of long names and the URL/PURL/Public
identifier controversy by keeping the names as short as possible.)

I don't see it as a big issue, but part of the benefit of using
xslt-charrep is to for the default is that it clearly suggest what
is going on. Assuming that no users would buy the spec, looking
at a raw schema and seeing things that look like XPaths and the
string "xslt" should give anyone with half a clue a pretty good idea.

> This clause currently fails to explain:
> 1) The role of the schema, title and pattern wrapper elements

All these things are belong in Part 3.

> 2) The origin of the names used to refer to character sets (which should
> make reference to ISO 10646 names at least)

See above.

> 3) How user-defined character sets can be defined and named.

Do you mean character repertoires? If so, then a Part 7 Schema
can have a system identifier (a URL) and can have an SGML
Formal Public Identifier too, like any SGML document.

Or do mean that you want a way to test that a (Unicode) XML document
only contains characters from a particular character encoding
such as Windows CP1252? Yes a schema can be made which constraints
documents to any character encoding. Pleeeeese don't ask me to
do this as part of the text of the spec: there are just too many
character sets. I think this will also be answered by the same
mechanism as I will put in for Murata-san's request for school
kanji: an <include> element. (Previously, I had been intending
to put in a <library> element at the top-level only, as a more
general mechanism more similar to XML Schema's include mechanism,
but I think doing something more like RELAX NG include meets
Murata-san's requirements and promotes a "family approach"
for DSDL specs.)

Or do you mean you want to be able to name repertoires and use
those names inside sch:assert/@test expressions? Part 7
already inherits, from Schematron, the ability to have
both parallel assertions (i.e. within the same rule, or
in different patterns), assertions in mutually exclusive contexts
(i.e. in different rules in the same patterns). So there
is no extra power gained from, say, having a macro mechanism
to preprocess sch:assert/@test attributes to insert named
expressions: but actually we do have such a mechanism:
abstract patterns.

More particularly, the abstract rule mechanism allows you
to define a complex expression inside a pattern, then
to reuse it in different rules in the same pattern. Similarly,
the abstract pattern mechanism allows you define a more
complex set of rules. For example, you could have inside
a schema

<!-- include the Schematron definitions for an abstract
   pattern that checks -->
<sch:include src="http://www.eg.com/charset/cp1252.sch"/>

<sch:pattern name="Windows" is-a="WindowsDocument">
        <param name="check" value="*" />
<sch:pattern>

where the referenced schema fragment would be something like

<sch:pattern name="WindowsDocument" abstract="true"
        xmlns:sch="http://www.ascc.net/xml/schematron">
        <sch:rule context=" $check ">
                <assert test="\p{IsBasicLatin}\p{IsLatin-1Supplement}
                &#x2010; &#x0192; &#x201E; &#x1026;
                &#x1020; &#x2021; &#x2C2; &#x2030; &#x160;
                &#x2039; &#x152; &#x2018; &#x2019; &#x201C;
                 &#x201D; &#x2022; &#x2013; &#x2014;
                &#x2DC; &#x2122; &#x161; &#x203A; &#x153; &#x178;
        </sch:rule>
</sch:pattern>

> Annex B:
> See comments on Annex A re need to change second example to cover French
> rather than Dutch.
> Re first example, does @name() pick up the element name rather than just
> name()?

Mistake.

> In fourth example, where does IsSmallFormVariants name come from?

That is a Unicode Block name, for the character range
U+FE50 to U+FE6D, see
http://www.unicode.org/Public/4.0-Update1/Blocks-4.0.1.txt

I hope the Japanese body will provide something better for this
example. If they can provide character classes for school kanji too,
that would be great.

> There also needs to be another appendix listing all the existing character
> set names from the Multiplane set as these provide default names for
> character sets. This list can be taken from Diedrick's 2003-11-17 text for
> Part 7. In general we need to add many of the functions provided by
> Diedrick's draft for allowing users to define their own character subsets
> before this can become a realistic CD text, though it makes a good first WD.

I think this would be completely the wrong approach, if it had any
normative status! We don't want to define properties ourselves jsut
the framework and the references for the default values. Properties
are ultimately the job of Unicode Consortium, and proximately
the job of the W3C XML Schema WG (or other implementers) who decide
which properties are interesting and how they should be named
(which is their perrogative according to the Unicode guidelines).

In other words, the concrete requirement is to decide whether
a particular library (i.e. W3C character classes) meets our
minimal use cases, not to define a set of maximal use cases
which nothing currently implements. It is not defining a new
language, but using the Schematron framework to bring in
an existing language, and then verifying that it meets
all of our "must-haves" and enough of our "should-haves" to
be useful.

Cheers
Rick Jelliffe

[1] http://www.unicode.org/notes/tn4/everson-iuc21pap.pdf
see page 3 "Character behaviours and proerties".. "is standardized
by (Unicode TC) but not formally taken into accound by W22."

[2] http://www.evertype.com/standards/iso10646/ucs-collections.html

[3] http://www.unicode.org/reports/tr18/

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Tue Apr 13 12:03:54 2004

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC