MURATA Makoto wrote:
>2) Canonical equivalents and compatibility equivalents
>...
>If we apply character normalization before character-repertoire
>validation, we can assume that canonical equivalents are normalized.
>
>Note: W3C assumes that everything is normalized in advance.
>http://www.w3.org/TR/charmod-norm/
>
I think there are two separate requirements for part 7
1) To restrict character repertoire for editorial reasons.
E.g. to avoid using crazy kanji that no-one knows.
2) To restrict character repertoire for production-process reasons
E.g. because the data will be saved as ISO 8859-1, or because there is no
font for PUA
In the first case, the hulls and kernels approach seems good. But in the
second case, users probably want top to say "ISO 8859-1" directly. And it
will be tedious for user to specify and double-check all those characters.
So Part 7 should have a way to define IANA character sets without
requiring an explicit hull-and-kernel specification somewhere.
I.e. so that an implementation could
1) canonicalize the text
2) round-trip it through its platforms transcoders for that character set
3) canonicalize it again, just to be sure
4) check that the input is the same as the output.
In other words, I think we need to allow that, for standard characters
sets, it is impractical or inefficient to expect that people will make up
explicit hull-and-kernel specifications, and certainly not for all known
character sets! Instead, Part 7 should make it trivial for implementers
to provide
all IANA character sets, using the method above. These can be composable
with the hull-and-kernel definitions, just implemented as tests in this
parallel mechanism.
To do this, I think Part 7 needs:
1) A way to name repertoires of characters. (This is already a requirement)
2) A reserved naming convention for IANA sets, so that implementations can
build in the alternative implementation approach above. (At one stage Dan
Connolly of W3C made up URIs for the IANA names, I don't recall what they
were: something like http://www.iana.org/charset/US-ASCII )
Cheers
Rick Jelliffe
-- DSDL members discussion list To unsubscribe, please send a message with the command "unsubscribe" to dsdl-discuss-request@dsdl.org (mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)Received on Fri Nov 12 04:46:46 2004
This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:28 UTC