<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/css" href="is.css" ?>

<!-- In order to validate, replace all 
"this draft standard" with <this/> and replace all 
"This draft standard" with <This/>. 
-->

<document  >

<head> 
<organization>ISO/IEC</organization>
<document-type>International Standard</document-type>
<stage>enquiry</stage>
<secretariat>ANSI</secretariat>
<tc-number>1</tc-number>
<tc-name>Information Technology</tc-name>
<sc-number>34</sc-number>
<sc-name>Document Description and Processing Languages</sc-name>
<serial-number>320</serial-number>
<wg-number>1</wg-number>
<document-number>19757</document-number>
<part-number>7</part-number>
<document-language>E</document-language>
<title>
  <main>Document Schema Definition Languages (DSDL)</main>
  <complementary>Character reportoire validation</complementary>
</title>
<date>2004-04-01</date>
</head>

<foreword>

<part-list>
<part><number>0</number><title>Overview</title></part>
<part><number>1</number><title>Interoperability framework</title></part>
<part><number>2</number><title>Grammar-based validation &#x2014; RELAX NG</title></part>
<part><number>3</number><title>Rule-based validation &#x2014; Schematron</title></part>
<part><number>4</number><title>Selection of validation candidates</title></part>
<part><number>5</number><title>Datatypes</title></part>
<part><number>6</number><title>Path-based integrity constraints</title></part>
<part><number>7</number><title>Character reportoire validation</title></part>
<part><number>8</number><title>Declarative document manipulation</title></part>
<part><number>9</number><title>Datatype- and namespace-aware DTDs</title></part>
</part-list>

</foreword>

<introduction>

<p>The structure of <this>this committee-draft standard</this> is as follows. 
<xref to="binding"/>
specifies the schema language as a particular query langage binding of Part 3 &#x2014; Schematron. 
 <xref to="conformance"/>
describes conformance requirements for implementations of character repertoire validators.
Finally, non-normative annexes provide motivating use-cases and examples.
</p>
</introduction>

<scope>

<p><This>This committee-draft standard</This> specifies a schema language for declaring and validating the allowed character repertoire
in data content, attributes and markup in XML documents. The language is specified as a particular 
query language binding of Schematron.</p>

<p>The schema language use XPaths to declare contexts for rules, and the character classes of 
a subset of a common regular expression
 to declare repertoires. Text nodes which are children of the nodes that match
 the contexts shall conform to the repertoire.</p>

<p><This>This committee-draft standard</This> establishes requirements for schemas and specifies
when an XML document matches the patterns specified by the
schema.</p>

</scope>

<normative-references>

<p>The following referenced documents are indispensable for the
application of <this>this committee-draft standard</this>. For dated references, only the edition
cited applies. For undated references, the latest edition of the
referenced document (including any amendments) applies.</p>

<note><p>Each of the following documents has a unique identifier that
is used to cite the document in the text.  The unique identifier
consists of the part of the reference up to the first
comma.</p></note>

<note><p>The definitions of Part 1, Part 2 and Part 3
also apply to <this>this committee-draft standard</this>.</p></note>

 
<referenced-document id="xpath-rec">
<abbrev>XPath</abbrev>
<title>XML Path Language (XPath) Version 1.0 </title>
<field>W3C Recommendation</field>
<url>http://www.w3.org/TR/1999/REC-xpath-19991116</url>
</referenced-document>

<referenced-document id="xslt-rec">
<abbrev>XSLT</abbrev>
<title>XSL Transformations (XSLT) Version 1.0</title>
<field>W3C Recommendation</field>
<url>http://www.w3.org/TR/1999/REC-xslt-19991116</url>
</referenced-document>



<referenced-document id="xsd-rec">
<abbrev>XSD</abbrev>
<title>XML Schema Part 2: Datatypes </title>
<field>W3C Recommendation</field>
<url>http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#nt-charGroup</url>
</referenced-document>

 

</normative-references>

<terms-and-definitions>

<term-and-definition>

<term>character class</term>

<definition>A grouping of characters, especially into named groups according to some property of the
characters.</definition>

</term-and-definition>

<term-and-definition>
<term>character repertoire</term>

<definition>The characters which may validly be used in some text node.</definition>

</term-and-definition>
 
<term-and-definition>
<term>regular expression</term>
<definition>An artificial language or group of dialects for expressing patterns in sequences of characters.</definition>
</term-and-definition>
 
</terms-and-definitions>

<clause id="binding"> 
<title>Schematron Query Language Binding</title>
 
 
<p>A  schema conforming to <this>this committee-draft standard</this> is a correct Schematron schema.
The value of the
<code>language</code> attribute, in any mix of upper and lower case letters, 
shall be  <code>xpath-charrep</code>.</p>

<p>The schema language use XPaths to declare contexts for rules, and the character classes of 
from XSD  regular expressions
 to declare repertoires. Text nodes which are children of the nodes that match
 the contexts shall conform to the repertoire.</p>
 

<note><p>The following other query language names are
reserved and recommended where appropriate, but not
defined by <this>this committee-draft standard</this> :

<code>stx-charrep</code>,
<code>xslt1.1-charrep</code>,
<code>exslt-charrep</code>,
<code>xslt2-charrep</code>,
<code>xpath-charrep</code>,
<code>xpath2-charrep</code>,
<code>xquery-charrep</code>.</p>

</note>

<p>The following binding shall be used:</p>
<ul> 
<li><p>The rule context is interpreted according to the production
1 of XSLT<xref to="xslt-rec"/>, as returning a boolean value. If the node matched is not a text or name node, then each
child text nodes is used.</p></li>
<li><p>The assertion test is interpreted according to production
13 of XML Schemas Datatypes <xref to="xsd-rec"/>.
Literal whitespace should be removed from the query.</p></li>
<li><p>The name query is interpreted according to
production 14 of XPath <xref to="xpath-rec"/> , as returning a string value.
</p></li>
<li><p>The value-of query is interpreted according to production
14 of XPath, as returning a string value.</p></li>
<li><p>The let element shall not be used. </p></li>
<li><p>Abstract patterns shall not be used.</p></li>
</ul>
 
 <note><p>Parallel constraints may be expressed using
 separate assertions.
 </p></note>
 
 <p>The XSLT <code>key</code> element should not be used.</p>
 
 <p>The order in which nodes are validated is not specified by 
 <this>this committee draft standard</this>.</p>
 
</clause>
<clause id="conformance">
<title>Conformance</title>

<p>A conforming  implementation shall be able to determine for
any XML document whether it is a correct schema.</p>
<ul>
<li><p>A correct schema conforms to the constraints
of the normative RELAX NG schema of part 3.</p></li>
<li><p>A correct schema has a language binding attribute
with a value terminated by the string
of <code>-charrep</code>.</p></li>
<li><p>A correct schema with a language binding attribute with value
<code>xslt-charrep</code> conforms to the language binding in
<this>this committee-draft standard</this>.</p></li>
<li><p>A correct schema's attributes conform to the grammars
specified by the query language binding in use.</p></li>
</ul>
</clause>

<annex normative = "false" >
<title>Use Cases</title>


<p>Motivating use-cases for the schema language include:</p>
<ul>
<li><p>restricting the generic identifiers of elements to ASCII characters;</p></li>
<li><p>ensuring that a Dutch document contains characters only used in typical
Dutch documents; the constraint applies to mixed content and element content;</p></li>
<li><p>checking that a document does not use any Latin combining characters;</p></li>
<li><p>declaring that data content in a Japanese document shall not contain 
<i>half-width katakana</i> characters;</p></li>
<li><p>verifying a school text book that data content of Japanese <i>ruby</i> annotations does not
contain Han ideographs and that other data content of elements should
contain only the restricted <i>joho kanji</i> repertoire; </p></li>
<li><p>verifying that an attribute value giving a person's name  in a Chinese document
only uses approved characters;</p></li>
<li><p>providing information to alert publishing staff if the data content of a document
contains characters outside the Unicode Basic Multilingual Plane, 
surrogate characters, or Private Use Area characters; and</p></li>
<li><p>verifying that the data content in a scientific document uses 
the Unicode character for micro symbol not the Greek small letter mu.</p></li>
</ul>

<p>Motivating use-cases for the schema language do not include:</p>
<ul>
<li><p>constraints on parts of a string, such as that an attribute should start
with a certain character;</p></li>
<li ><p>semantic constraints requiring analysis of the particular string, such as
that that an attribute may contain letters or numbers but not both;</p></li>
<li><p>repertoire constraints between different strings in the document, 
such as that an element can only use the character repertoire as used in
some other part of the document; and</p></li>
<li><p>constraints involving arithmetic operations, such as that the sum of
all code values in the string should not exceed 300.</p></li>
</ul>


<p>As well, certain kinds of constraints are out-of-scope for <this>this committee-draft standard</this>:</p>
<ul>
<li><p>the maximum and
minimum length of strings;</p></li>
<li ><p>the character encoding (character set) used for an entity;</p></li>
<li ><p>the use of standard entities, numeric character references or literal characters; </p></li>
<li><p>that the characters of a Thai document are ordered correctly; and </p></li>
<li ><p>that the initial characters of a portion of a string marked up with an entity or included
by some macro mechanism is not a combining character.</p></li>
</ul>
</annex>
<annex normative="false">
<title>Example Schemas for Use-Cases</title>
<p>The following schema fragments should be placed in the following wrapper.</p>
<pre xml:space="preserve" ><![CDATA[
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" 	language="xslt-charrep" >
	<sch:title>Examples of Use Cases</sch:title>
	<sch:pattern name="Example">
  ...
  </sch:pattern>
</sch:schema>
]]></pre>

<pre xml:space="preserve" ><![CDATA[	
		<sch:rule context="*/@name()"> 
			<sch:assert test="[\p{IsBasicLatin}]">
				Generic identifiers of elements should be ASCII reportoire.
			</sch:assert>
		</sch:rule> 
]]></pre>


<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="*[/*[@xml:lang='nl']]">
			<sch:assert test="\p{IsBasicLatin}\p{IsLatin-1Supplement}
								&#x132;&#x133;\p{IsGeneralPunctuation}\p{IsCurrencySymbols}">
				If this document is a Dutch document, it should have only characters 
				used in typical Dutch publishing.
			</sch:assert>
		</sch:rule> 
]]></pre>
 

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="*">
			<sch:assert test="^\p{Lm}">
				This document should not use any Latin combining characters.
			</sch:assert>
		</sch:rule> 
]]></pre>

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="* | @*">
			<sch:assert test="^\p{IsSmallFormVariants}">
				Elements and attributes should not contain half-width katakana characters.
			</sch:assert>
		</sch:rule> 
]]></pre>

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="rb">
			<sch:assert test="^\p{IsCJKUnifiedIdeographs}">
				Ruby annotations should not contain Han ideographs. 
			</sch:assert> 
		</sch:rule> 
		</sch:rule> 
]]></pre> 

<pre><![CDATA[	 
		<sch:rule context="*">
			<sch:assert test="&#x01;-&#xFFEF;">This document should not 
					contain characters outside the Unicode Basic Multilingual Plane.
			</sch:assert>
			<sch:assert test="^&#xD800;-&#xDFFF;">This document should not 
					contain characters surrogate characters.
			</sch:assert>
			<sch:assert test="^\p{Co}">This document should not 
					contain Private Use Area characters.
			</sch:assert>
		</sch:rule> 
]]></pre>

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="*">
			<sch:assert test="^&#x3BC;">
				The micro symbol should be used, not the Greek small letter mu.
			</sch:assert>
		</sch:rule> 
]]></pre>

 
</annex>
</document>
