From: "Martin Bryan" <mtbryan@sgml.u-net.com>
> Should our datatypes be "canonical datatypes" - i.e. types that can safely
> be used for checking between transformations, rather than "raw datatypes" of
> the type that can be used to control transformation.?
Yes, that is what I am suggesting: that the primitives can (only) be simple if
we come to grips with multiple lexical spaces. Indeed, coming to
grips with multiple lexical spaces may allow a minimally small set
of primitives, closer to the LISP/XSLT kind of set.
> The XML type for data is simply "string" - it has
> no lexical form at all. What DSDL datatypes are all about is the application
> of a "notation" to strings that occur in particular contexts.
A unicode string is certainly a notation: there are code points that may be unavailable,
there are rules for combining character sequence orders, there are guidelines
for characters that are unsuited for markup (e.g. control characters)!
> >For DSDL, our datatypes should be available in DTDs as NOTATIONs.
>
> This I cannot agree with, if I read what you say correctly. Datatypes are
> constraints on strings but they should not be declared as notations, which
> were specifically designed to identify data that was not to be interpreted
> as a string. The other major problem I have is that we need to associate
> datatypes with attribute values, and particularly with lists of permitted
> values for both elements and attributes. I don't see how we can do this with
> the SGML/XML Notation specification as it exists today.
> But we would need to specify a completely new way of specifying notations,
> which would confuse the hell out of the community.
SGML is a notation. XML is a notation. The community has no idea what
a notation is at the moment, and we would do better to rescue this useful
term and rather than overpopulate the world with more words for the same
thing.
> As you point out below, measurements are a key factor that needs to be
> allowed for, as do ratios (either of the lbs per sq ft type or of the more
> complicated $15 per cubic litre type). But this is a really thorny issue,
> and one that no standardization body has yet tackled successfully :-(
There are two issues. One is to convert datatypes to a canonical form:
the most that is practical for numbers is to multiply them (by some rational
number) and change their unit, and check their range. Anything more
complicated than that is not an issue of simple datatyping but of complex
datatyping and should be left to a subsequent validator (such as
Schematron or perhaps some built-in validators).
So if the user declares that there is a number which is the canonical form
<dsdl:number unit="dollars_per_cubic_litre">15</dsdl:number>
we should support
<dsdl:number unit="pounds_per_cubic_ft">26</dsdl:number>
as a simple multiplication, but not to provide equations, as if this were
a spreadsheet.
> Its not just validation that is the problem, you may also have a
> transformation problem as well. You may need to automatically convert
> figures in lbs per sq ft to some standardized measurement such as bar for it
> to be usable on output. Yet the source format has to be retained as that is
> needed for legal validity of the data.
I don't think the primary purpose of DSDL is to transform text to be
usable for output. Of course, our transformations could be useful,
but that should be a small side-effect not something we should consider
carefully.
> While I agree that all the structure should not need to be in angle
> brackets, I think there is a need to distinguish between the primitives of a
> terse format, e.g. day/month/year and the terse structure of the input,
> 31/12/02. DSDL needs to define both a) the primitives and b) the rules that
> identify valid structures that can be constructed from those primitives
> within a particular document type (e.g. 20040302 is validatable as it is
> unambiguous, whereas 02/03/04 is ambiguous as it has three possible
> interpretations if no clear rules are specified.).
Well, I am not sure that 20040302 is any more ambiguous or primitive than
02/03/04. The only way to have an unambiguous format is to label the data
with the format, validate the data and make sure the format is itself unambiguous.
But on your point that we need primitives and the rules, I think that is what
I am proposing. You are using "identification" of fragments, and I am
proposing this is expressed in terms of "extraction and transformation",
but that is just the formalism which accords with the general DSDL
framework, and just acheives that "identification", so I don't think we
are really disagreeing that much.
> We should recommend that
> data storage should always be in an unambiguous format, even if input is
> allowed in an ambiguous format providing the rules are clearly stated in the
> DTD.
And I completely disagree that it is desirable that people should be encouraged
away from unidiomatic data. Americans should use American dates;
Moslems should use the Islamic calendar, if they want to. The choice should
be because of the specific interoperability requirements of the system and
people they are involved with, not a technical fait accompli.
I do not think our model is
database->XML->validation
where the lexical form can be tweaked as part of generating the XML, in that
kind of case the data has been validated on input into the DBMS and we don't
really need to concern ourselves with it. Instead, I hope our basic model is
just the vanilla
XML_on_fileSystem->validation
> Does this mean that you think datatypes should not have anything to say
> about lists? The problem is that we need to know when a string represents a
> single value and when it identifies a set of values. How do I do this
> without having some form of tokenizer?
Yes, a primitive datatype is no a list. At most I would have fraction+unitName
as atoms. Yes, we do need some kind of tokenizer, which is that second
stage I mentioned that transforms into the primitive datatypes. Otherwise
you end up with "primitive" types that are no primitive at all (such as
XML Schemas) and a system that is difficult to extend.
As for the tokenizer details, I would just have the minimum needed to
cope with tokens separated by whitespace, by fixed delimiter, and
by Unicode property. So "15cm" can be tokenized by 15 being digits
and cm being letters. I don't see there is a need for regex-based
tokenizing: that is overkill as far as I can see.
Actually, the XML Schemas WG had repeated requests to provide
COBOL-style "pictures". So to be able to go
YYYY-DD-MM
as the template for data values. Supporting this kind of specification
is something that we might consider too.
> >So, too complex, too globalized, too biased, too hazy, too weak. The
> usual complaints :-)
>
> Is this of my proposed set, or of the W3C schema datatypes?
The W3C set!
> Where do you draw the line between "complex structures" and "token lists".
> Is "8 picas" a complex structure, or a token list? Is 31/12/02 a complex
> structure or a primitive datatype? Is "3.5" a fixed point decimal or a
> floating point mantissa?
My proposal is that for number we need support only rational numbers (fractions)
(http://mathworld.wolfram.com/RationalNumber.html
with a unit indicator, and build everything else on top of that. So
<dsdl:number unit="cm">86/85</dsdl:number>
e.g. with a lexical space conforming to the regex
-?\d+\s*(\/\s*\d+)?
So your examples are simple, complex and simple. The datatype
canonicalizer (a transformer) passes the following to the simple datatype
validator. In the case of the date, a complex datatype validator
for dates can run also.
<dsdl:number unit="picas">8</dsdl:number>
<dsdl:bag name="date">
<dsdl:number unit="day">31</dsdl:number>
<dsdl:number unit="month">12</dsdl:number>
<dsdl:number unit="year">02</dsdl:number>
</dsdl:bag>
<dsdl:number>35/10</dsdl:number>
> >So the datatype mechanism should be in two parts:
>
> 1) Primitive datatype validation
> A Validator. Validate an element in the dsdl: namespace whose
> element name specifies the primitive type in the
> single canonical form. Such as <dsdl:integer>15</dsdl:integer>
>
> I don't like the idea of having to put everything you want validated into an
> element with a specific namespace. As I said above, datatypes are about the
> validation of the strings that remain after parsing has taken place, not the
> validation of the markup.
The transformation I am talking about is not in-place substitution. So
<x>18 cm</x>
does not become
<x><dsdl:number units="cm">18</dsdl:number></x>
Instead,
<dsdl:number units="cm">18</dsdl:number>
is sent off to a validator and validated against the values
expected for an <x> element.
And, of course, this is only the notional operation. An actual
efficent implementation would not be generating a reparsing it all
as XML text but as a data structure.
> (What do the rest of
> you think: should the datatype specification concern itself with data
> transformation both too and from a canonicalized form?)
The purpose of the canonical form is to allow comparison, not storage.
> >Under this specification, "date" is not a primitive type. It is a
> "structured
> datatype".
>
> Glad you agree with me on something :-)
Oh, I better change my mind quick!
>> The primitive types should be very few, and geared around
>> the operations to be performed on them for comparison:
>> number
>> boolean
>> string
>> binary
> Not sure that we can get away with "number" on its own. (What do others
> think here?)
Oh, this is a definite area where there can be lots of opinions and no right
answer, and I am not claiming expertise here, just flying the minimalist kite.
I tend to wish we could just support rational numbers (fractions)
(http://mathworld.wolfram.com/RationalNumber.html
with a unit indicator, and build everything else on top of that. So
<dsdl:number unit="cm">86/85</dsdl:number>
e.g. with a lexical space conforming to the regex
-?\d+\s*(\/\s*\d+)?
Another question is whether a symbol is indeed just a string.
>>The "structure types" can be more complex:
>> date
>> url
>> multipart name
>> path
>> FPI
>> value/unit pair
>
> What about more complex structures such as ratios (generalized fractions
> with units as well as numbers), lists and arrays? Where do we stop?
We look at where people stop when they make documents. If people use
arrays of floating point numbers with units as values of attributes even
despite being unable to validate these, it is a pretty strong indication that
it is an idiom we should attempt to support. (Indeed, this was one thing
that Chris Lilley of W3C mentioned to me: that arrays and units were
important for SVG and it would be a really useful thing to support.)
This is why I think having use-cases based on typical publishing formats
is critical: XML Schemas only went to use-cases relating to publishing
(e.g. XHTML only) at Last Call, and only tacked on a couple of features
grudgingly.
I guess part of it comes down to whether we want to support validation
of the kind of values that people enter into forms, and the kind of text that
are found in marked-up documents for publishing uses. I have no requirement
that we support the former (you may, legitimately) but definitely have
a requirement that the latter is supported. XML Schemas is based on
having "other levels" provide the mediation to canonical types: forms
entry programs and middleware; that this is not workable for publishing
is why we are here.
> >So what positive use cases should it have? I suggest that the following
> languages should be what guides us, because they are publishing related
> and widespread, or in the family.
>
> 1) DOCBOOK and CALS tables
> 2) XSLT and XSL-FO
> 3) SVG
> 4) RELAX NG, Schematron, Topic Maps
>
> But none of these have datatypes? What is your point here Rick?
Of course they cannot have formal datatype declarations at the moment,
there is no schema language that can cope with their requirements
currently!
We can look at them to see what datatypes are needed, rather than
basing it on theory, which may just lead us inexorably down the either
the road of useless minimalism or of bloated maximalism, depending
on the particular vices of the personalities involved, on humble balance
of probabilities.
Cheers
Rick Jelliffe
Cheers
Rick
-- DSDL members discussion list To unsubscribe, please send a message with the command "unsubscribe" to dsdl-discuss-request@dsdl.org (mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)Received on Tue May 28 05:49:41 2002
This archive was generated by hypermail 2.1.8 : Fri Dec 03 2004 - 14:00:27 UTC