<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/css" href="is.css" ?>

<!-- In order to validate, replace all 
"this draft standard" with <this/> and replace all 
"This draft standard" with <This/>. 
	update RNC content model to allow para at start of terms and defs 
	-->
	<!--
   In this version:
    production fixes
    fix /@name() typo	
	  allow use in vocabs
	-->

<document  >

<head> 
<organization>ISO/IEC</organization>
<document-type>International Standard</document-type>
<stage>enquiry</stage>
<secretariat>ANSI</secretariat>
<tc-number>1</tc-number>
<tc-name>Information Technology</tc-name>
<sc-number>34</sc-number>
<sc-name>Document Description and Processing Languages</sc-name>
<serial-number>320</serial-number>
<wg-number>1</wg-number>
<document-number>19757</document-number>
<part-number>7</part-number>
<document-language>E</document-language>
<title>
  <main>Document Schema Definition Languages (DSDL)</main>
  <complementary>Character reportoire validation</complementary>
</title>
<date>2004-04-14</date>
</head>
<!-- ==========================================================-->

<foreword>

<part-list>
<part><number>1</number><title>Interoperability framework</title></part>
<part><number>2</number><title>Grammar-based validation &#x2014; RELAX NG</title></part>
<part><number>3</number><title>Rule-based validation &#x2014; Schematron</title></part>
<part><number>4</number><title>Selection of validation candidates</title></part>
<part><number>5</number><title>Datatypes</title></part>
<part><number>6</number><title>Path-based integrity constraints</title></part>
<part><number>7</number><title>Character reportoire validation</title></part>
<part><number>8</number><title>Declarative document manipulation</title></part>
<part><number>9</number><title>Datatype- and namespace-aware DTDs</title></part>
<part><number>10</number><title>Fill in</title></part>
</part-list>

</foreword>
<!-- ==========================================================-->

<introduction>

<p>The structure of <this>this committee-draft standard</this> is as follows. </p>
<ul>
<li><p><xref to="syntax"/>
specifies the schema language as a  Part 3 (Schematron) schema language.
</p>
</li>
<li><p ><xref to="binding"/>
specifies the schema language as a particular query langage binding of ISO Schematron.
</p></li>
<li ><p ><xref to="conformance"/>
describes conformance requirements for implementations of character repertoire validators.
</p>
</li><li ><p>Non-normative annexes provide motivating use-cases and examples.
</p>
</li></ul>


</introduction>
<!-- ==========================================================-->

<scope>

<p><This>This committee-draft standard</This> specifies a schema language for declaring and validating the allowed character repertoire
in data content, attribute values, identifiers and other markup in XML documents. The language is specified as a particular 
query language binding of Part 3 (Schematron.)</p>

<p>The schema language uses <xref to="xslt-rec"/> path expressions to specify contexts for assertions
and <xref to="xsd-rec"/> character classes to specify repertoires.</p>

<p><This>This committee-draft standard</This> establishes requirements for schemas and specifies
when an XML document matches the patterns specified by the
schema.</p>

</scope>
<!-- ==========================================================-->

<normative-references>
 

<p>Each of the following documents has a unique identifier that
is used to cite the document in the text.  The unique identifier
consists of the part of the reference up to the first
comma.</p>

<referenced-document id="xpath-rec">
<abbrev>XPath</abbrev>
<title>XML Path Language (XPath) Version 1.0 </title>
<field>W3C Recommendation</field>
<url>http://www.w3.org/TR/1999/REC-xpath-19991116</url>
</referenced-document>

<referenced-document id="xslt-rec">
<abbrev>XSLT</abbrev>
<title>XSL Transformations (XSLT) Version 1.0</title>
<field>W3C Recommendation</field>
<url>http://www.w3.org/TR/1999/REC-xslt-19991116</url>
</referenced-document>



<referenced-document id="xsd-rec">
<abbrev>XSD</abbrev>
<title>XML Schema Part 2: Datatypes </title>
<field>W3C Recommendation</field>
<url>http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#nt-charGroup</url>
</referenced-document>

 

</normative-references>
<!-- ==========================================================-->

<terms-and-definitions>


<p>The definitions of Part 1, Part 2 and Part 3
also apply to <this>this committee-draft standard</this>.</p>

 

<term-and-definition>

<term>character class</term>

<definition>A grouping of characters, especially into named groups according to some property of the
characters.</definition>

</term-and-definition>

<term-and-definition>
<term>character repertoire</term>

<definition>The characters which may validly be used in some text node.</definition>

</term-and-definition>
 
<term-and-definition>
<term>regular expression</term>
<definition>An artificial language or group of dialects for expressing patterns in sequences of characters.</definition>
</term-and-definition>
 
</terms-and-definitions>

<!-- ==========================================================-->

<clause id="syntax"> 
<title>Syntax</title>
 
<p>A  schema conforming to <this>this committee-draft standard</this> shall be a correct 
Schematron schema according to Part 3 (Schematron.) </p>
  
<p>If the value of the <code>language</code> attribute of the Schematron <code>schema</code> element
is  <code>xslt-charrep</code>, in any mix of upper and lower case letters, then the
default query language binding of <this>this committee-draft standard</this> shall be used.</p>

 <note><p><This>This committee-draft standard</This> reserves following query language names 
without further definition. Implementions which use 
different query language bindings shall use 
one of these names if appropriate:

<code>stx-charrep</code>,
<code>xslt-charrep</code>,
<code>xslt1.1-charrep</code>,
<code>exslt-charrep</code>,
<code>xslt2-charrep</code>,
<code>xpath-charrep</code>,
<code>xpath2-charrep</code>,
<code>xquery-charrep</code>.</p>

</note>

 <p>The order in which nodes and characters are validated is not specified by 
 <this>this committee draft standard</this>.</p>
 
</clause>
<!-- ==========================================================-->

<clause id="binding"> 
<title>Default Query Language Binding</title>
 
<p>The value of the <code>language</code> attribute of the Schematron <code>schema</code> element
shall be  <code>xslt-charrep</code>, in any mix of upper and lower case letters.</p>


<p>If the value of the <code>language</code> attribute of the Schematron <code>schema</code> element
is  <code>xslt-charrep</code>, in any mix of upper and lower case letters, then the
Default.</p>

<p>The following binding shall be used:</p>
<ul> 
<li><p>The rule context shall be interpreted according to the production
1 of <xref to="xslt-rec"/>, as returning any kind of  <xref to="xpath-rec"/> node
when applied using the semantics of Part 3 (Schematron). </p></li>
<li><p>The assertion test shall be interpreted according to production
13 of <xref to="xsd-rec"/>,
 returning a boolean.
Each string shall be tested on a character-by-character basis.
Newlines and tabs shall be removed from the query
before it is used to allow longer strings; consequently these characters must
be entered using the delimited forms
<code>\n\t\r</code> or the character classes.
It shall not be an error if the same character or class is entered more than once.
</p></li>
<li><p>The name query shall be interpreted according to
production 14 of   <xref to="xpath-rec"/> , as returning a string value.
</p></li>
<li><p>The value-of query shall be interpreted according to production
14 of <xref to="xpath-rec"/>, as returning a string value.</p></li>
<li><p>The let element shall not be used. </p></li>
<li><p>Abstract patterns shall not be used.</p></li>
</ul>
<note><p>The nodes allowed as subjects for assertions in 
<this>this committee-draft standard</this> could be different from 
the nodes allowed by the default query language binding of
Part 3 (Schematron).</p></note>

 <p>The <xref to="xpath-rec"/> data model shall be used. 
For the purpose of <this>this committe-draft standard</this>
an <xref to="xpath-rec"/>  string is an <xref to="xsd-rec"/> string.
For each node type, the strings to be tested are as follows:</p>
<ul>
<li><p>For an element node: the string-values of each text node of that element node,
but not the data content of any descendents.</p></li>

<li><p>For an attribute node: the string-value of the attribute, as would be accessible
through the <xref to="xpath-rec"/> <code>text()</code> function for that subject.</p></li>

<li><p>For an comment node: the string-value of the comment, as would be accessible
through the <xref to="xpath-rec"/> <code>text()</code> function for that subject.</p></li>

<li><p>For an processing instruction node: the string-value of the processing instruction, as would be accessible
through the <xref to="xpath-rec"/> <code>text()</code> function for that subject.</p></li>

<li><p>For name, target and namespace nodes: the name as a string.</p></li>

<li><p>For any other nodes, including the document root node: error.</p></li>
</ul>
 
 <p>The <xref to="xslt-rec"/> <code>key</code> element shall not be used.</p>
  
</clause>

<!-- ==========================================================-->

<clause id="conformance">
<title>Conformance</title>

<clause id="simple-conformance">
<title>Simple Conformance</title>
<p>A simple-conformance implementation has the same requirements
as a simple-conformance implemention of Part 3 (Schematron.)
</p>
</clause>

<clause id="full-conformance">
<title>Full Conformance</title>

<p>A full-conformance implementation has the same constraints
as a full-conformance implemention of Part 3 (Schematron)
and the following additional requirements.
</p>

<ul>
<li><p>The schema has a language binding attribute
with a value terminated by the string
of <code>-charrep</code>.</p></li>
<li><p>The schema with a language binding attribute with value
<code>xslt-charrep</code> conforms to the language binding in
<this>this committee-draft standard</this>.</p></li>
</ul>
</clause>
</clause>
<!-- ==========================================================-->

<annex normative = "false" >
<title>Use Cases</title>

<p>Motivating use-cases for <this>this committee-draft standard</this> include:</p>
<ol>
<li><p>ensuring that a Dutch document contains characters only used in typical
Dutch documents; the constraint applies to mixed content and element content;</p></li>
<li><p>checking that a document does not use any Latin combining characters;</p></li>
<li><p>declaring that data content in a Japanese document shall not contain 
<i>half-width katakana</i> characters;</p></li>
<li><p>providing information to alert publishing staff if the data content of a document
contains characters outside the Unicode Basic Multilingual Plane, 
surrogate characters, or Private Use Area characters;</p></li>
<li><p>verifying that the data content in a scientific document uses 
the Unicode character for micro symbol not the Greek small letter mu;</p></li>
<li><p>restricting the generic identifiers, attribute values and contents of elements to CP1252 character set;</p></li>
<li><p>verifying a school text book that data content of Japanese <i>ruby</i> annotations does not
contain Han ideographs and that other data content of elements should
contain only the restricted repertoire used for schools; because the reportoire is
large, it must be declared in an external library and referenced; and </p></li>
<li><p>verifying that an attribute value giving a person's name  in a Chinese document
only uses approved characters.</p></li>

</ol>

<p>Motivating use-cases for the schema language do not include:</p>
<ul>
<li><p>constraints on parts of a string, such as that an attribute should start
with a certain character;</p></li>
<li ><p>semantic constraints requiring analysis of the particular string, such as
that that an attribute may contain letters or numbers but not both;</p></li>
<li><p>repertoire constraints between different strings in the document, 
such as that an element can only use the character repertoire as used in
some other part of the document; and</p></li>
<li><p>constraints involving arithmetic operations, such as that the sum of
all code values in the string should not exceed 300.</p></li>
</ul>


<p>As well, certain kinds of constraints are out-of-scope for <this>this committee-draft standard</this>:</p>
<ul>
<li><p>the maximum and
minimum length of strings;</p></li>
<li ><p>the character encoding (character set) used for an entity;</p></li>
<li ><p>the use of standard entities, numeric character references or literal characters; </p></li>
<li><p>that the characters of a Thai document are ordered correctly; and </p></li>
<li ><p>that the initial characters of a portion of a string marked up with an entity or included
by some macro mechanism is not a combining character.</p></li>
</ul>
</annex>
<!-- ==========================================================-->

<annex normative="false">
<title>Example Schemas for Use-Cases</title>

<clause id="simpleExamples">
<title>Simple Examples</title>


<p>Ensuring that a Dutch document contains characters only used in typical
Dutch documents; the constraint applies to mixed content and element content</p>
<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="*[/*[@xml:lang='nl']]">
			<sch:assert test="\p{IsBasicLatin}\p{IsLatin-1Supplement}
								&#x132;&#x133;\p{IsGeneralPunctuation}\p{IsCurrencySymbols}">
				If this document is a Dutch document, it should have only characters 
				used in typical Dutch publishing.
			</sch:assert>
		</sch:rule> 
]]></pre>
 

<p>Checking that a document does not use any Latin combining characters</p>

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="*">
			<sch:assert test="^\p{Lm}">
				This document should not use any Latin combining characters.
			</sch:assert>
		</sch:rule> 
]]></pre>


<p>Declaring that data content in a Japanese document shall not contain 
<i>half-width katakana</i> characters</p>

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="* | @*">
			<sch:assert test="^\p{IsSmallFormVariants}">
				Elements and attributes should not contain half-width katakana characters.
			</sch:assert>
		</sch:rule> 
]]></pre>


<p>Verifying a school text book that data content of Japanese <i>ruby</i> annotations does not
contain Han ideographs</p>

<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="rb">
			<sch:assert test="^\p{IsCJKUnifiedIdeographs}">
				Ruby annotations should not contain Han ideographs. 
			</sch:assert> 
		</sch:rule> 
		</sch:rule> 
]]></pre> 


<p>Providing information to alert publishing staff if the data content of a document
contains characters outside the Unicode Basic Multilingual Plane, 
surrogate characters, or Private Use Area characters</p>
<pre><![CDATA[	 
		<sch:rule context="*">
			<sch:assert test="&#x01;-&#xFFEF;">This document should not 
					contain characters outside the Unicode Basic Multilingual Plane.
			</sch:assert>
			<sch:assert test="^&#xD800;-&#xDFFF;">This document should not 
					contain characters surrogate characters.
			</sch:assert>
			<sch:assert test="^\p{Co}">This document should not 
					contain Private Use Area characters.
			</sch:assert>
		</sch:rule> 
]]></pre>



<p>Verifying that the data content in a scientific document uses 
the Unicode character for micro symbol not the Greek small letter mu</p>
<pre xml:space="preserve" ><![CDATA[	 
		<sch:rule context="*">
			<sch:assert test="^&#x3BC;">
				The micro symbol should be used, not the Greek small letter mu.
			</sch:assert>
		</sch:rule> 
]]></pre>


<p>The preceding schema fragments should be placed in the following wrapper for them
to be correct schemas.</p>
<pre xml:space="preserve" ><![CDATA[
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" 	language="xslt-charrep" >
	<sch:title>Examples of Use Cases</sch:title>
	<sch:pattern name="Example">
  ...
  </sch:pattern>
</sch:schema>
]]></pre>
</clause>

<clause id="complexExamples">
<title>Complex Examples</title>

<p>The following mechanisms allow schemas to be divided into separate sections, files or documents
to reduce reclaration and for better management:</p>
<ul>
<li><p>Character class fragments can be invoked in tests using XML entity references;</p></li>
<li><p>Abstract rules; and</p></li >
<li><p>Abstract patterns.</p></li>
</ul>
<p>Examples of each of these approaches follow.</p>
<note><p>The following examples do not attempt to provide all possible schemas that express the constraints.</p></note>

<p>Restricting the generic identifiers, attribute values and contents of elements to characters in the CP 1252 character set
using an internal entity reference and a pattern with three rules.
</p>

<pre xml:space="preserve" ><![CDATA[	
<?xml version="1.0" standalone="true"?>
<!DOCTYPE sch:schema [
	<!ENTITY cp1252 
		"\p{IsBasicLatin}\p{IsLatin-1Supplement}
			&#x2010; &#x0192; &#x201E; &#x1026;
			&#x1020; &#x2021; &#x2C2; &#x2030; 
			&#x160; &#x2039; &#x152; &#x2018;
			&#x2019; &#x201C; &#x201D; &#x2122;
			&#x161; &#x203A; &#x153; &#x178;" >
]>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" 	language="xslt-charrep" >
	<sch:title>Character Repertoire Schema for CP1252</sch:title>
  
  <sch:pattern>
		<sch:rule context=" */name() "> 
			<sch:assert test="&cp1252;">
				Attribute values should be CP1252 repertoire.
			</sch:assert>
		</sch:rule> 
		<sch:rule context=" @* "> 
			<sch:assert test="&cp1252;">
				Attribute values should be CP1252 repertoire.
			</sch:assert>
		</sch:rule> 
		<sch:rule context=" * "> 
			<sch:assert test="&cp1252;">
				Data content should be CP1252 repertoire.
			</sch:assert>
		</sch:rule> 
	</sch:pattern>
</sch:schema>
	]]>
	</pre>

<p>Restricting the generic identifiers, attribute values and contents of elements to characters in the CP 1252 character set
using an abstract pattern, then three concrete instances of the pattern are created, one for generic identifiers
and one for attribute names, and one for  data content.
</p>

<pre xml:space="preserve" ><![CDATA[	
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" 	language="xslt-charrep" >
	<sch:title>Character Repertoire Schema for CP1252</sch:title>

	<sch:pattern abstract="true" name="cp1252-text>
		<sch:rule context=" $context "> 
			<sch:assert test=
				"\p{IsBasicLatin}\p{IsLatin-1Supplement}
				&#x2010; &#x0192; &#x201E; &#x1026;
				&#x1020; &#x2021; &#x2C2; &#x2030; 
				&#x160; &#x2039; &#x152; &#x2018;
				&#x2019; &#x201C; &#x201D; &#x2122;
				&#x161; &#x203A; &#x153; &#x178;">
				Text should be CP1252 repertoire.
			</sch:assert>
		</sch:rule> 
	</sch:pattern>	
	
	<sch:pattern is-a="cp1252-text" >
	  <sch:param name="context" value=" */name()  "
	</sch:pattern>
	
	<sch:pattern is-a="cp1252-text" >
	  <sch:param name="context" value=" @* "
	</sch:pattern>
	
	<sch:pattern is-a="cp1252-text" >
	  <sch:param name="context" value=" * "
	</sch:pattern>
	
</sch:schema>	
]]></pre>

<p>Restricting the generic identifiers, attribute values and contents of elements to characters in the CP 1252 character set
using an abstract rule in an external document.
First, an external XML document is created at URL <code>http://www.eg.com/cp1252.sch</code> containing
an abstract rule.

</p>

<pre xml:space="preserve" ><![CDATA[	
		<sch:rule abstract="true " name="cp1252" xmlns:sch="http://www.ascc.net/xml/schematron"> 
			<sch:assert test=
			"\p{IsBasicLatin}\p{IsLatin-1Supplement}
			&#x2010; &#x0192; &#x201E; &#x1026;
			&#x1020; &#x2021; &#x2C2; &#x2030; 
			&#x160; &#x2039; &#x152; &#x2018;
			&#x2019; &#x201C; &#x201D; &#x2122;
			&#x161; &#x203A; &#x153; &#x178;">
				Text should be CP1252 repertoire.
			</sch:assert>
		</sch:rule> 
]]>
</pre>

<p>Then this external abstract rule is included in the pattern.  In this example, the rules are combined into a single
rule.</p>

<pre xml:space="preserve" ><![CDATA[	

	<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" 	language="xslt-charrep" >
	<sch:title>Character Repertoire Schema for CP1252</sch:title>

  <sch:pattern>
    <sch:include src="http://www.eg.com/cp1252.sch" />
    
		<sch:rule context=" */name() | @* | *"> 
			<extends rule="cp1252"/>
		</sch:rule> 
		
	</sch:pattern>
</sch:schema>

]]></pre>

<p>Other use-cases which may be solved by the same mechanisms are:</p>
<ul><li><p>verifying that other data content of elements should
contain only the restricted repertoire used for schools; because the reportoire is
large, it must be declared in an external library and referenced; and </p></li> 
<li><p>verifying that an attribute value giving a person's name  in a Chinese document
only uses approved characters.</p></li>
</ul>


</clause>

 
</annex>
</document>
