[dsdl-discuss] Re: A revised draft for Part 7

From: Keld Jørn Simonsen <keld@rap.rap.dk>
Date: Sat Nov 24 2007 - 10:19:07 UTC

On Wed, Nov 21, 2007 at 08:44:26AM +1100, Rick Jelliffe wrote:
> On Tue, 2007-11-20 at 09:52 +0100, Keld Jørn Simonsen wrote:
>
>
> > And anyway I don't see any harm in allowing such characters in markup,
> > as long as we don't try to parse them.
>
> A character in a wrong encoding is data corruption. There is no greater
> harm for documents.

Well, the control characters of a given charset are then in the right
encoding, so that should not be the problem here. Anyway the C0 control
characters are the same for most chrsets.

> > Anyway, what would you do if a control character beyond the whitespaces
> > shows up in the markup? Treat the whole markup as invalid?
>
> Yes. That is what XML does, and what the W3C I18n WG and the Unicode
> consortium and the W3C XML WG does. It is based on an approach agreed on
> by members of SC34 WG1 (me, James, etc).

But we are not talking Unicode as far as I understand. We are talking
all other charsets.

> > For C1, a number of coded character sets use this range, including many
> > Microsoft charsets, and ISO 10646.
>
> This is exactly why they must be barred. XML insists on true labelling
> of characters with their encoding. A document labelled ISO 8859-1 that
> has some of the MS 1252 characters is not a well-formed XML document.
> This is basics for XML, finalized 10 years ago, why are we even
> discussing it?

I thought XML was always Unicode?

If you label the encoding as cp1252, then you are using the C1 space for
normal characters, so C1 should not be forbidden in general.

Best regards
Keld

--
DSDL members discussion list
To unsubscribe, please send a message with the
command  "unsubscribe" to dsdl-discuss-request@dsdl.org
(mailto:dsdl-discuss-request@dsdl.org?Subject=unsubscribe)
Received on Sat Nov 24 11:19:10 2007

This archive was generated by hypermail 2.1.8 : Tue Nov 27 2007 - 19:23:02 UTC