[dsdl-discuss] Re: DSDL - Datatypes and Character Sets

From: MURATA Makoto <murata@hokkaido.email.ne.jp>
Date: Tue Oct 05 2004 - 02:24:04 UTC

> The problem, as I see it, is when is a Character Repertoire a Datatype.
> They are both restricted sets of Unicode codepoints.

By definition, a datatype is a set of STRINGS. I think that a
character repertoire is a set of CODE POINTS.

> What I was thinking of was "Grapheme Clusters" see paras 2.2 and 3.2 of
> http://www.unicode.org/reports/tr18/ .

Thanks for this pointer. Now, I think that a character repertoire
should probably be a set of default grapheme clusters.

> 1. If we define Chinese as containing the Unicode blocks: CJK Unified
> Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility
> Ideographs, CJK Compatibility Forms, Enclosed CJK Letters and Months,
> Small Form Variants, Bopomofo, Bopomofo Extended, this is clearly a
> Character Repertoire.

I suppose that Chinese have the same requirements as Japanese. In other
words, code blocks are rarely useful.

> 2. If we define decimal as [0-9]{1,}\.[0-9]{1,10}, this is clearly a
> datatype.

Not to me.

> 3. But where between these 2 extremes is the dividing line, for
> restricted sets of Unicode codepoints, for which Character Repertoires
> are on one side and Datatypes are on the other. For example is a string
> which can only contains Chinese a Datatype or a Character Repertoire. I
> believe that the decision should be left to the user and that,
> therefore, the <datatypes> and <characterRepertoires> document elements
> should contain, as far as possible, the same substructures. This would

I disagree. Character repertoires and datatypes are very different.

Character repertoire descriptions have to represent huge collections of characters
by enumerating lots of lots of code values. To ease schema authoring
and maintenance, we need rich syntax in terms of elements and attributes.
Meanwhile, datatypes representing a set of strings does not have to
enumerate lots of lots of code values. Thus, the pattern facet of XML
Schema Part 2 is good enough.

> Another complication is that there are cases where there is a
> interaction between Character Repertoires and Datatypes, for example:
> a. In the UK we use Euro100,000.00 and in some other European
> countries they Euro100.000,00
> b. UK date 13/02/2004, US date 02-13-2004
> Perhaps we need a way of forcing datatypes to meet requirements that
> could be implicit in a Character Repertoire.

Both are examples of datatypes. They do not require Part 7.

Cheers,

-- 
MURATA Makoto <murata@hokkaido.email.ne.jp>
--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Tue Oct 5 04:24:24 2004

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC