Here is my stab at some ideas for datatyping.
From: "Martin Bryan" <mtbryan@sgml.u-net.com>
> I have, as is my custom, taken the word Primitive at its dictionary meaning:
> "simple, primary, radical, not derivative, from which another is derived"
> (not rude or uncivilised, which are alternative definitions!). I have,
> therefore, defined the basic components of times and dates as primitives
> rather than derived data, and have just defined two types of number, fixed
> point and floating point, as primitives from which others are derived.
> Hopefully this will start a raging argument over what are and are not
> primitive datatypes.
I think we need to build on XML Schema's mistakes, as part of our requirements,
and so it would be interesting to get a feel for where we think they have missed
the mark.
If they have not missed the mark, then we should not waste our time, but merely
adopt them!
Here is where I think XML Schemas goes wrong:
1) They introduce the nice distinction between lexical space and value space,
but they do not take advantage of it to allow localization. In publishing, the
data is often a given fact that cannot be taqmpered with. Using XML Schemas,
cannot merely say
<USdate>12/31/02</USdate>
one has to go
<USdate value="20021231">12/31/02</USdate>
which then creates a maintenance issue of keeping the two in synch.
I believe that XML Schemas will never support localized datatypes,
because DBMS vendors regard this as a source of inefficiency,
and because such things belong to some other (vaporous) layer.
2) They do not address the relationship between "notation" and datatype.
Indeed, a Schema WG member wrote early on "whatever the answer is,
it is not notations", yet the Schema WG never determined what a notation is.
Since we have made the decision that our schema works XML<->XML,
we never have the case where there is a value existing without a lexical
form; so we never have a "type" that is not a "notation".
For DSDL, our datatypes should be available in DTDs as NOTATIONs.
And if our types can be extended (both in number or by adding new
lexical spaces) these extensions are NOTATIONs. We do not need to
invent new terms.
3) They do not cope with structured fields; or, at least, they provide
specific mechanism for a couple of important ones (URIs and dates)
but not for anything else. In publishing it is fundamental that you only
markup the data to the amount that you need to make sense of it
when reading: usually this results in documents with fields that
*could* be validated (in the DTD writer's opinion) by an automated
method in a pinch, but would not be unless there is some problem.
At the moment, such validation must used custom code (e.g. OmniMark,
or Perl)
Furthermore, I think there is a human tendency to like leaf structures
to be in a different terse notation, when they have some grouped semantic
that means the parts do not make sense except together than where it
can all be fitted on a line. So I don't buy the argument that all structure
is in angle brackets. That is one kind of structure that clearly does not
apply when we get to datatypes.
4) They do not cope with units properly. This is not only an issue
where there are multiple lexical forms (e.g. 16", 86cm, 82 pica),
but also in things like formulae where we might not know what the
value space is, without being told.
5) Things like space handling and lists are clearly at some different
level than "lexical space" or "value space".
So, too complex, too globalized, too biased, too hazy, too weak. The
usual complaints :-)
By now you should be thinking "this does not sound too primitive to
me?" and you are right. I think that the primitive datatypes should
avoid *anything* which relates to whitespace manipulation, token lists,
complex structures.
So the datatype mechanism should be in two parts:
1) Primitive datatype validation
A Validator. Validate an element in the dsdl: namespace whose
element name specifies the primitive type in the
single canonical form. Such as <dsdl:integer>15</dsdl:integer>
2) Datatype canonicalization
A Transformer. Transform an item from a given notation into
an XML structure consisting of elements which the validator
can understand. (Complex constraints between these items
can be specified using a subsequent Schematron schema, or
supplied by some built-in notation validators which can validate
the canonical-form of some well-known types, such as date.)
Under this specification, "date" is not a primitive type. It is a "structured
datatype". The primitive types should be very few, and geared around
the operations to be performed on them for comparison:
number
boolean
string
binary
The "structure types" can be more complex:
date
url
multipart name
path
FPI
value/unit pair
Furthermore, all types should allow fallback symbols, which
are tokens that can be used in their place to circumvent
normal primitive datatyping. For example
<dsdl:integer token="#undefined" />
This copes with the main use case for XML Schemas
union types (which is a very nice feature, but it does
not really fit in anywhere.)
So what positive use cases should it have? I suggest that the following
languages should be what guides us, because they are publishing related
and widespread, or in the family.
1) DOCBOOK and CALS tables
2) XSLT and XSL-FO
3) SVG
4) RELAX NG, Schematron, Topic Maps
Cheers
Rick Jelliffe
-- DSDL members discussion list To unsubscribe, please send a message with the command "unsubscribe" to dsdl-discuss-request@dsdl.org (mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)Received on Mon May 27 10:33:58 2002
This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:27 UTC