[dsdl-discuss] Three debatable features for Part 7

From: MURATA Makoto <murata@hokkaido.email.ne.jp>
Date: Tue Oct 05 2004 - 14:35:15 UTC

This mail suggests three features for Part 7. If we introduce all of
them, Part 7 will become difficult to develop but become more useful.
How do you feel?

1) Kernel and Hull

Martin Duerst introduced kernels and hulls in W3C TR "A Notation
for Character Collections for the WWW" (2001).

http://www.w3.org/TR/charcol/#Kernels

        A kernel contains characters that are guaranteed to be in the
        collection; the collection may contain other characters. A
        hull gives an outer boundary so that characters which are not
        in the hull are guaranteed not to be in the collection; some
        characters in the hull may not actually be in the collection.

We can borrow this idea. Suppose that a schema references to a
character repertoire description consisting from a kernel and hull.
If some character in an instance document does not belong to the hull,
this document is invalid. On the other hand, if some character in an
instance document does not belong to the kernel, we have warning
message.

Even if we do not borrow this idea, Schematron (as a host language) can
mimic it by using two Schematron rules (one for the kernel
and another for the hull). However, RELAX NG or DTD do not have this
option.

2) Canonical equivalents and compatibility equivalents

If we allow a character, it makes sense to allow canonically
equivalent characters as well. But it might be tedious to specify
all canonical equivalents.

If we apply character normalization before character-repertoire
validation, we can assume that canonical equivalents are normalized.

Note: W3C assumes that everything is normalized in advance.
http://www.w3.org/TR/charmod-norm/

3) Grapheme

Casual users do not always consider a Unicode character as a character.
Rather, a grapheme (a sequence of characters) is more natural for
casual users. For example, "e" followed by an accent is a grapheme.

Ideally, we should provide two modes. The first mode handles each
Unicode character as a basic unit. The second mode handles each default
grapheme cluster, which is defined by Unicode Standard Annex #29
("Text Boundaries"), as a basic unit.

Note: Unicode Technical Standard #18 ("Unicode Regular Expressions")
introduces the second mode, but XML Schema Part 2 does not.

The second mode has some advantages shown below:

- We can allow a base character X while disallowing
  a combining character Y to follow X.

- We can allow a base character X only when
  a combining character Y follows X.

- We can allow a combining character Y only
  when it follows a base character X.

- We can allow some sequences of Hangul Jamos (say
  X1+Y1+Z1 and X2+Y2+Z2) without allowing all
  combinations (e.g., X1+Y2+Z2).

-- 
MURATA Makoto <murata@hokkaido.email.ne.jp>
--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Tue Oct 5 16:35:51 2004

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC