>
>
>>Herewith attached some thoughts on the Character Repertoires Requirements:
>>
>>
>
>I have a problem with your second item:
>
>2. The syntax shall enable character repertoires to be restricted
>by:
>
>a. Minimum number of characters
>
>b. Maximum number of characters
>
>c. Characters not permitted.
>
>d. Character combinations not permitted
>
>e. Character combinations required
>
>f. Characters not permitted if other nominated character sequences
>are present/not present.
>
>g. Characters required if other nominated character sequences are
>present/not present
>
>
>
>a and b are already part of datatype specifications, and are not really
>relevant within character repertoire descriptions
>
>
What I was thinking of was such things as the number of characters in
element/attribute names and /or an entire document,
>d and e should be, but currently are not, part of a datatype definition - I
>can't see how we can stop people using any combinations of declared
>characters unless you put them in exclusions as strings <exclude>"ab"
>"cd"</exclude> in Jeni's proposal
>
>
What I was thinking of was "Grapheme Clusters" see paras 2.2 and 3.2 of
http://www.unicode.org/reports/tr18/ .
Could either use <exclude>"ab" "cd"</exclude>as in Jeni's proposal or
regular expressions.
Example
[|a-z\q{aa}] |Match a-z, and aa (treated as a single character in Danish)
>f and g are again part of a regular expression rather than a repetoire
>description, I think. I suspect we need to be able to utilize Schematron to
>test such rules so it's up to how Schematron can be used to test character
>repetoire roles.
>
Character Repertoires could be defined using regular expression,
however, I only added f and g as I do not know enough about the needs of
all languages to exclude them.
The problem, as I see it, is when is a Character Repertoire a Datatype.
They are both restricted sets of Unicode codepoints.
1. If we define Chinese as containing the Unicode blocks: CJK Unified
Ideographs, CJK Unified Ideographs Extension A, CJK Compatibility
Ideographs, CJK Compatibility Forms, Enclosed CJK Letters and Months,
Small Form Variants, Bopomofo, Bopomofo Extended, this is clearly a
Character Repertoire.
2. If we define decimal as [0-9]{1,}\.[0-9]{1,10}, this is clearly a
datatype.
3. But where between these 2 extremes is the dividing line, for
restricted sets of Unicode codepoints, for which Character Repertoires
are on one side and Datatypes are on the other. For example is a string
which can only contains Chinese a Datatype or a Character Repertoire. I
believe that the decision should be left to the user and that,
therefore, the <datatypes> and <characterRepertoires> document elements
should contain, as far as possible, the same substructures. This would
have the additional advantages that DSDL would have to only define one
syntax not 2, that users would have only to learn one not 2 syntaxes
and developers could develop one software package which would do both tasks.
Another complication is that there are cases where there is a
interaction between Character Repertoires and Datatypes, for example:
a. In the UK we use Euro100,000.00 and in some other European
countries they Euro100.000,00
b. UK date 13/02/2004, US date 02-13-2004
Perhaps we need a way of forcing datatypes to meet requirements that
could be implicit in a Character Repertoire.
Peter
* *
-- DSDL members discussion list To unsubscribe, please send a message with the command "unsubscribe" to dsdl-discuss-request@dsdl.org (mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)Received on Sat Oct 2 20:48:59 2004
This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC