[dsdl-discuss] Re: Datatypes

From: Martin Bryan <mtbryan@sgml.u-net.com>
Date: Tue May 28 2002 - 07:48:16 UTC

Rick

Thanks for your comments. Some thoughts on them follow

>I think we need to build on XML Schema's mistakes, as part of our
requirements,
and so it would be interesting to get a feel for where we think they have
missed
the mark.

The problem I have is that the complaints from different sources suggest
different problems, with most things being complained about at one stage or
another. What are the "enduring complaints"?

>If they have not missed the mark, then we should not waste our time, but
merely
adopt them!

I don't think we want to adopt the exact wording of the W3C text, first
because of its obfuscatory nature and secondly because the way constraints
are specified is not in conformance with ISO policy. I will have to rewrite
the text to conform to ISO rules, and hope to be able to simplify the
constraint specification in the process.

>Here is where I think XML Schemas goes wrong:

>1) They introduce the nice distinction between lexical space and value
space,
but they do not take advantage of it to allow localization. In publishing,
the
data is often a given fact that cannot be taqmpered with. Using XML Schemas,
cannot merely say
  <USdate>12/31/02</USdate>
one has to go
 <USdate value="20021231">12/31/02</USdate>
which then creates a maintenance issue of keeping the two in synch.

>I believe that XML Schemas will never support localized datatypes,
because DBMS vendors regard this as a source of inefficiency,
and because such things belong to some other (vaporous) layer.

This is a nice example of where the DSDL framework is needed before the
datatypes spec. The problem is that the "canonical value" which is what
should be stored, is transformed twice, once from the input value (12/31/02
from a Yank!) and once to the output value (31/12/2002 for a Brit!). You
need to know the context of both the sender and the receiver before you can
convert a date like 02/02/02.)

Should our datatypes be "canonical datatypes" - i.e. types that can safely
be used for checking between transformations, rather than "raw datatypes" of
the type that can be used to control transformation.?

>2) They do not address the relationship between "notation" and datatype.
Indeed, a Schema WG member wrote early on "whatever the answer is,
it is not notations", yet the Schema WG never determined what a notation is.

>Since we have made the decision that our schema works XML<->XML,
we never have the case where there is a value existing without a lexical
form; so we never have a "type" that is not a "notation".

I feel this is misleading. The XML type for data is simply "string" - it has
no lexical form at all. What DSDL datatypes are all about is the application
of a "notation" to strings that occur in particular contexts.

>For DSDL, our datatypes should be available in DTDs as NOTATIONs.

This I cannot agree with, if I read what you say correctly. Datatypes are
constraints on strings but they should not be declared as notations, which
were specifically designed to identify data that was not to be interpreted
as a string. The other major problem I have is that we need to associate
datatypes with attribute values, and particularly with lists of permitted
values for both elements and attributes. I don't see how we can do this with
the SGML/XML Notation specification as it exists today.

>And if our types can be extended (both in number or by adding new
lexical spaces) these extensions are NOTATIONs. We do not need to
invent new terms.

But we would need to specify a completely new way of specifying notations,
which would confuse the hell out of the community.

>3) They do not cope with structured fields; or, at least, they provide
specific mechanism for a couple of important ones (URIs and dates)
but not for anything else.

This was one to the points I was trying to make by including the list of
things from other datatype standards, which treat structures like fractions,
arrays, tables, etc as basic datatypes. I am far from convinced myself that
this is a good thing, but want to start a debate on it here.

As you point out below, measurements are a key factor that needs to be
allowed for, as do ratios (either of the lbs per sq ft type or of the more
complicated $15 per cubic litre type). But this is a really thorny issue,
and one that no standardization body has yet tackled successfully :-(

> In publishing it is fundamental that you only
markup the data to the amount that you need to make sense of it
when reading: usually this results in documents with fields that
*could* be validated (in the DTD writer's opinion) by an automated
method in a pinch, but would not be unless there is some problem.
At the moment, such validation must used custom code (e.g. OmniMark,
or Perl)

Its not just validation that is the problem, you may also have a
transformation problem as well. You may need to automatically convert
figures in lbs per sq ft to some standardized measurement such as bar for it
to be usable on output. Yet the source format has to be retained as that is
needed for legal validity of the data.

>Furthermore, I think there is a human tendency to like leaf structures
to be in a different terse notation, when they have some grouped semantic
that means the parts do not make sense except together than where it
can all be fitted on a line. So I don't buy the argument that all structure
is in angle brackets. That is one kind of structure that clearly does not
apply when we get to datatypes.

While I agree that all the structure should not need to be in angle
brackets, I think there is a need to distinguish between the primitives of a
terse format, e.g. day/month/year and the terse structure of the input,
31/12/02. DSDL needs to define both a) the primitives and b) the rules that
identify valid structures that can be constructed from those primitives
within a particular document type (e.g. 20040302 is validatable as it is
unambiguous, whereas 02/03/04 is ambiguous as it has three possible
interpretations if no clear rules are specified.). We should recommend that
data storage should always be in an unambiguous format, even if input is
allowed in an ambiguous format providing the rules are clearly stated in the
DTD.

4) They do not cope with units properly. This is not only an issue
where there are multiple lexical forms (e.g. 16", 86cm, 82 pica),
but also in things like formulae where we might not know what the
value space is, without being told.

Simple measurements are relatively easy to parse, its the ratios and the
compound figures (13ft 3&half;") that are the killers. The question comes as
to where we should draw the line. (Thoughts anyone?)

>5) Things like space handling and lists are clearly at some different
level than "lexical space" or "value space".

Does this mean that you think datatypes should not have anything to say
about lists? The problem is that we need to know when a string represents a
single value and when it identifies a set of values. How do I do this
without having some form of tokenizer?

>So, too complex, too globalized, too biased, too hazy, too weak. The
usual complaints :-)

Is this of my proposed set, or of the W3C schema datatypes?

>By now you should be thinking "this does not sound too primitive to
me?" and you are right. I think that the primitive datatypes should
avoid *anything* which relates to whitespace manipulation, token lists,
complex structures.

Where do you draw the line between "complex structures" and "token lists".
Is "8 picas" a complex structure, or a token list? Is 31/12/02 a complex
structure or a primitive datatype? Is "3.5" a fixed point decimal or a
floating point mantissa?

>So the datatype mechanism should be in two parts:

 1) Primitive datatype validation
   A Validator. Validate an element in the dsdl: namespace whose
   element name specifies the primitive type in the
   single canonical form. Such as <dsdl:integer>15</dsdl:integer>

I don't like the idea of having to put everything you want validated into an
element with a specific namespace. As I said above, datatypes are about the
validation of the strings that remain after parsing has taken place, not the
validation of the markup.

> 2) Datatype canonicalization
    A Transformer. Transform an item from a given notation into
   an XML structure consisting of elements which the validator
   can understand. (Complex constraints between these items
   can be specified using a subsequent Schematron schema, or
   supplied by some built-in notation validators which can validate
   the canonical-form of some well-known types, such as date.)

I'm not convinced that "transformation rules" should be part of the dataype
spec, though I do think datatype canonicalization is useful. I particularly
dislike the idea of transforming "into an XML structure consisting of
elements". I believe tranformation rules should be part of the constraint
language or the general purpose transformation mechanisms. What I would
accept is that we should have a canonicalized form for the storage of some
non-quantity datatypes (such as dates and numbers) but that the other
datatypes should just be "pattern matching strings". (What do the rest of
you think: should the datatype specification concern itself with data
transformation both too and from a canonicalized form?)

>Under this specification, "date" is not a primitive type. It is a
"structured
datatype".

Glad you agree with me on something :-)

> The primitive types should be very few, and geared around
the operations to be performed on them for comparison:
  number
  boolean
  string
  binary

Not sure that we can get away with "number" on its own. (What do others
think here?)

>The "structure types" can be more complex:
  date
  url
  multipart name
  path
  FPI
  value/unit pair

What about more complex structures such as ratios (generalized fractions
with units as well as numbers), lists and arrays? Where do we stop?

>Furthermore, all types should allow fallback symbols, which
are tokens that can be used in their place to circumvent
normal primitive datatyping. For example
  <dsdl:integer token="#undefined" />
This copes with the main use case for XML Schemas
union types (which is a very nice feature, but it does
not really fit in anywhere.)

>So what positive use cases should it have? I suggest that the following
languages should be what guides us, because they are publishing related
and widespread, or in the family.

1) DOCBOOK and CALS tables
2) XSLT and XSL-FO
3) SVG
4) RELAX NG, Schematron, Topic Maps

But none of these have datatypes? What is your point here Rick?

Martin Bryan

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Tue May 28 03:54:17 2002

This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:27 UTC