Minimal DSDL Datatype Specification

This document proposes a minimal set of datatypes for the validation of the character data strings that provide text nodes within XML elements, the contents of CDATA sections and attribute values within XML documents.

[Issue M1: Should datatypes be applicable to CDATA sections, or should these require the specification of a notation to process their contents?]

Before datatype validation of a document's contents can take place processing must have taken place to apply the rules of XML that a) convert the input to UTF-8 or UTF-16 format, with any relevant rules for combining character sequences (such as the Unicode Normalization Form C) and b) parsed the markup to identify text nodes and attribute values for which datatype validation needs to be performed. As the datatype validation rules are also expressed in XML they must be processed in the same way. The rules used for parsing datatype declarations must apply the same rules for the combining of character sequences, etc, as are applied to the document instance.

Each DSDL datatype definition is assigned a unique identifier within the datatype definition. The type of a particular element in a DTD or schema is determined by assigning a dsdl-5:type attribute to the element (normally as a default value within the DTD or schema). The value of this attribute must be the identifier of one of the datatypes defined in the datatype definitions file used to validate the document instance, which is associated with the DTD/schema by a dsdl-5:datatypeLibrary attribute.

[Issue M2: How do we relate the Part 5 properties to the <data datatypeLibrary="u" type="ln"> construct already defined in Part 2? (Part 2 should not be constraining other parts. There should be a common way of identifying datatypes within datatype libraries within the DSDL namespace. The question arises whether there is a single DSDL namespace, or a set of them? If the latter then Part 5 needs its own one, and Part 2 needs to refer to that.)]

1. Primitive Datatypes

Note: Terminals in italic are defined in the W3C XML 1.0 specification

dsdl-5:datatype ::= dsdl-5:stringLiteral | dsdl-5:numberLiteral | dsdl-5:booleanLiteral | 
                  dsdl-5:calendarLiteral | dsdl-5:periodLiteral | dsdl-5:quanityLiteral |
                  dsdl-5:currency | dsdl-5:ratioLiteral | dsdl-5:resourceIdentifier |
                  dsdl-5:listLiteral

dsdl-5:stringLiteral ::= CharData

Note: Data that is not character data should be identified as having a specific notation. This includes data coded as hexadecimal or base64 binary data, or any other data whose meaning needs an interpretation other then as basic character data.

dsdl-5:numberLiteral ::= dsdl-5:integerLiteral | dsdl-5:decimalLiteral | dsdl-5:exponentLiteral

dsdl-5:integerLiteral ::= ['+'|'-']? [0-9]+

[Issue M3: Should leading zeros be allowed for integers?]

dsdl-5:decimalLiteral ::= ['+'|'-']? [0-9]+ ('.' [0-9]+)?

[Issue M4: Should the decimal point and at least one following number be compulsory for decimals?]

dsdl-5:exponentLiteral ::= ['+'|'-']? [0-9]+ ('.' [0-9]+)? ('e'|'E') ['+'|'-']? [0-9]+ 

[Issue M5: Should the literal pattern prevent numbers such as 0.0e0 from being defined?]

dsdl-5:booleanLiteral ::= ('true' | '1') | ('false' | '0')

[Issue M6: Should T and F on their own be recognized as valid boolean literals? What about definitions for use with languages other than English?]

dsdl-5:calendarLiteral ::= CharData

Note: A compulsory dsdl-5:dtf attribute is used to constrain the data type format (dtf) of an element or attribute whose content conforms to the dsdl-5:calendarLiteral datatype. The following codes can be used within a dsdl-5:dtf pattern to identify the subcomponents of entered data:

CC     Century (Must immediately precede a YY string. May be preceded by a hyphen to indicated dates before the start of a calendar)
YY     Year (00-99)
M      Month (1-12)
MM     Month (01-12)
MMM    Month (as text string in language identified by compulsory xml:lang attribute for element)
D      Date (1-31)
DD     Date (01-31)
h      Hour (0-23)
hh     Hour (00-23)
mm     Minutes (00-59)
ss.sss Second and, optionally, fraction of second (00-59.999) or
ss,sss Second and, optionally, fraction of a second (00-59,999)
Z      Universal Time Co-ordinates to apply
[+|-]  Timezone offset to apply (followed by hh:mm, whose values must be in range 00:15 to 14:00)

For example, an ISO 8601 extended date/time could be defined as dsdl-5:dtf="CCYY-MM-DDThh:mm:ss[+|-]hh:mm".

[Issue M7: Should we allow the ISO 8601 Day in Year (2002-365) and Day in Week in Year (2002W156) formats to be defined as well?]

[Issue M8: Should we allow the fractional part of hours and minutes to be entered? Should we allow for commas as delimiters for fractional parts as well as periods, so that all ISO 8601 formats are allowed for? Do we need to allow for leap seconds that can add 60 as a valid number for seconds on specified dates in certain years?]

[Issue M9: Do we need to be able to recognize strings such as 23rd May 2002 as valid dates? Do other languages use qualified day numbers?]

dsdl-5:periodLiteral ::= '-'? 'P' [0-9]+ ('.' [0-9]+)? 'Y'
                                ([0-9]+ ('.' [0-9]+)? 'M'
                                 ([0-9]+ ('.' [0-9]+)? 'D'
                                  ('T' [0-9]+ ('.' [0-9]+)? 'H'
                                       ([0-9]+ ('.' [0-9]+)? 'M'
                                        ([0-9]+ ('.' [0-9]+)? 'S'
)?)?)?)?)?

dsdl-5:quanityLiteral ::= [0-9]+ ('.' [0-9]+)? S? dsdl-5:quantifier

dsdl-5:quantifier ::= [^0-9.] CharData

dsdl-5:currency ::= (dsdl-5:ISO4217currency, dsdl-5:decimal) | (dsdl-5:decimal, dsdl-5:currencyIndicator)

dsdl-5:ISO4217currency ::= [A-Za-z&dollar;&pound;&yen;&euro;&cent;]

[Issue M9: What other currency indicators does ISO 4217 recognize?]

dsdl-5:ratioLiteral ::= [dsdl-5:decimal '/' dsdl-5:decimal S dsdl-5:quantifier?]

dsdl-5:resourceIdentifier ::= [^&{}|^\`"<> ]

Note: The resource identifier must be a valid IETF resource identifier, as defined in IETF RFC2396 and any documents that extend or replace this.

dsdl-5:listLiteral ::= dsdl-5:datatype (S dsdl-5:datatype)+

[Issue M10: Should all items in a list be confined to a single datatype?]

2. Constraining Properties

The following properties can be used to constrain datatypes that are derived from dsdl-5:stringLiterals:

and either

or

The following properties can be used to constrain datatypes that are derived from dsdl-5:numberLiterals:

The following additional properties can be used to constrain dsdl-5:decimals:

The following additional properties can be used to constrain dsdl-5:integers:

[Issue M11: Do we need any additional constraints for dates/times/durations?]

3. Defining Customized Datatypes

Where users require application-specific datatypes ....