opentag.com
\\ XML and Localization :: FAQ :: Encoding

You will find here the answers to some of the frequently asked questions about character representation in XML and related technologies. If you find any mistakes or have suggestions for additional useful information, please send an email.

What is the default encoding of an XML document?

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

First bytes Encoding assumed
EF BB BF UTF-8
FE FF UTF-16 (big-endian)
FF FE UTF-16 (little-endian)
00 00 FE FF UTF-32 (big-endian)
FF FE 00 00 UTF-32 (little-endian)
None of the above UTF-8

Note that the encoding of an XML document is never iso-8859-1 by default.

One of the most common mistake when editing an XML document is to add some extended characters and forget to set the encoding declaration at the top of the document.

Which encodings XML parsers support?

All XML processors must support at least UTF-8 and UTF-16.

The support of other encodings, is encouraged but not required. However, many parsers implement supports for all ISO 8859 encodings, as well as for the most used Japanese, Chinese, and Korean ones.

How do I specify the encoding of a document?

Use the XML declaration or this. For example, to specify iso-8859-1 (Latin-1) encoding:

<?xml version="1.0" encoding="iso-8859-1" ?>

The values for the encoding declarations are the charset name defined by the IANA (Internet Assigned Numbers Authority). There are the same values as for the http-equiv charset meta-tag in HTML.

See http://www.iana.org/assignments/character-sets for the complete list of the IANA charset names.

If the file is provided through HTTP, you can also specify the encoding of a document by using the HTTP charset parameter in the Content-Type field. However, from a localization viewpoint, it's always better to keep the encoding declaration within the document itself: as the file will be moved around and processed in non-HTTP environments.

How do I specify the encoding of a CSS file?

The encoding of a CSS file is determined according the following rules:

  1. If the file uses HTTP: By the HTTP charset parameter in the Content-Type field.
  2. By the value for the @charset command at the top of the CSS file.
  3. By the declaration mechanism of the referencing document, if one exists. For example in XHTML: the charset attribute of the <link> element.

For example, to specify iso-8859-1 (Latin-1) encoding:

@charset "iso-8859-1"

The values for the @charset command are the charset name defined by the IANA (Internet Assigned Numbers Authority). There are the same values as for the XML encoding declaration.

See http://www.iana.org/assignments/character-sets for the complete list of the IANA charset names.

What are UTF-8 and UTF-16?

UTF-8 and UTF-16 are two encoding schemes of Unicode. UTF stands for Unicode Transformation Format. There are various types of UTFs. They are simply different way map the Unicode code-points to a digital representation.

If a document does not have any encoding declaration and no BOM, it is assumed its encoding is UTF-8.

For more detailed information on UTF see the UTF FAQ at the Unicode Web site.

What is the BOM (Byte-Order-Mark)?

The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.

Note that the BOM has the same value as the Unicode character U+FEFF: the ZERO WIDTH NON-BREAK SPACE.

Big-endian byte order (a.k.a. network byte order) is used by processors such as Motorola or RISC (most significant byte is stored first). Little-endian byte order (most significant byte is stored last) is used by processors such as Intel or Vax. Both mechanism have advantages and drawbacks.

BOM Encoding
EF BB BF UTF-8
FE FF UTF-16 (big-endian)
FF FE UTF-16 (little-endian)
00 00 FE FF UTF-32 (big-endian)
FF FE 00 00 UTF-32 (little-endian)

Byte order is important only for encodings using units greater than 8-bits (i.e. UTF-16, UCS-2, UTF-32, etc.). It has no bearing in UTF-8, but some tools (such as Notepad in Windows 2000) still placed a BOM at the top of the UTF-8 files anyways: this is allowed.

The terms big-endian and little-endian come from an article from Danny Cohen written in 1980 where he compares the technical disagreement over how to represent bytes in a 16-bit word to the fierce and rather pointless difference the Lilliputians of Swift's Gulliver's Travels have over how to break hard-boiled eggs. The Little-Endian partisans follow an imperial edict that orders the eggs should be opened at their smaller end, while the more conservative Big-Endian supporters want to keep cracking their eggs at the bigger end. Both factions claim to interpret correctly the precept of the great prophet Lustrog who wrote that "All true believers shall break their eggs at the convenient End".

For more detailed information on the BOM or UTF encodings see the UTF FAQ at the Unicode Web site.

How do I convert an XML document from one encoding to another?

Some XML editors might allow you to specify the encoding of the document when saving it. Otherwise you can use one of the encoding conversion tools (a.k.a. transcoders) available. Unfortunately few of them have support to deal with NCRs and character entity references. Rainbow, a Windows freeware tool, offer a function to convert the encoding of XML documents.

You can also use an XSL template where you simply duplicate the whole XML document as-it, but specify the desired output encoding in the <xsl:output/> element. For example, the following XSL template will output the same XML document as the input, but with its encoding changed to Shift_JIS:

<?xml version="1.0" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
 <xsl:output encoding="Shift_JIS"/>
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

You can use this template for converting to any encoding supported by the XSL processor you are using: simply replace the encoding attribute in the <xsl:output/> element by the name of the desired output encoding. You may also have to specify a <xsl:preserve-space/> element if the content of some of the elements in the input document need to be preserved. See the documentation for <xsl:output/> and <xsl:preserve-space/> for more details on how to fine-tune the type of output.

Any character not supported by the output encoding will be output as NCR. You just make sure the output encoding supports all characters used as element and attribute names.

The template can be used with a command-line XSL processor. For example, in Windows with MSXSL the syntax would be:

C:/>msxsl MyInput.xml ConvertEncoding.xsl > MyOutput.xml

Where MyInput.xml is the document to convert, ConvertEncoding.xsl is the template shown above, and MyOutput.xml is the resulting document.