![]() |
opentag.com a place for localization tools and technologies |
\\ XML and Localization :: FAQ :: Character Representation | ||||||||||||||||||||||||||
You will find here the answers to some of the frequently asked questions about character representation in XML and related technologies. If you find any mistakes or have suggestions for additional useful information, please send an email.
What characters can I use in an XML document?The XML specifications define the list of the characters allowed. Most Unicode characters are allowed in an XML document. They are: U+0009, U+000A, U+000D, [U+0020-U+D7FF], [U+E000-U+FFFD], and [U+10000-U+10FFFF]. in XML, use normalized characters as described in the document Character Model for the World Wide Web (Working Draft). Note that the use of compatibility characters is not recommended. For more detailed information on the non-suitability of some Unicode characters in XML, see the Unicode Technical report #20. There are some limitations in XML 1.0 that prevent the use of all characters used in certain languages to be used in XML names (element or attribute names for example). To solve these limitations, the XML version 1.1. What is an NCR?NCR stands for Numeric Character Reference. It is the term often used to designate a character written in hexadecimal or decimal format in XML. The hexadecimal form is the preferred form (easier to refer to the Unicode value).
NCRs (and character entity references) cannot be used in element and attributes names, in CDATA sections, in processing instructions and in comments. Conversion Issue - You need NCRs only when a character is not included in the encoding you are using. However, there are some cases where using NCRs even if the encoding supports the character: That is the case for example with a few Japanese characters that present conversion problems when going from one encoding to another or when going from one encoding to the Unicode value. The problem and the characters are documented in the XML Japanese Profile. Does 146 in
|
Entity | Entity Reference | Description | Value |
---|---|---|---|
amp | & |
ampersand (&) | U+0026 |
lt | < |
less-than (<) | U+003C |
apos | ' |
apostrophe (') | U+0027 |
quot | " |
double quote (") | U+0022 |
gt | > |
greater-than (>) | U+003E |
The forgiving implementations of most HTML parsers that allows the use
of character entity references such as á
without
declaration should not be permitted with XML.
UTF-8 and UTF-16 supports all valid characters, there is really no reason (except for escaping special characters, and in some special cases to avoid conversion problems), to use NCRs or character entity references. If you want to use some form of "escaped" notation of the characters, you probably want to use NCR rather than character entity references.
A character entity (or any entity for that matter) is defined outside the body of your document, often in a separate file so it can be reused with several documents.
The declaration statement is:
<!ENTITY entity_name "NCR">
For example a character entity for the Euro currency symbol could be declared as:
<!ENTITY euro "&x20AC;">
And it would be used in a document as follow:
<p>the character € is the symbol for the Euro.</p>
Entity declaration statements should be in the DTD or in a module included in the DTD. To include an entity declaration set in a DTD use for example:
<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent"> %HTMLlat1;
You can find the entity declaration for the many characters with the
XHTML Entity Sets:
-
http://www.w3.org/TR/2001/WD-xhtml1-20011004/DTD/xhtml-lat1.ent,
-
http://www.w3.org/TR/2001/WD-xhtml1-20011004/DTD/xhtml-special.ent,
-
http://www.w3.org/TR/2001/WD-xhtml1-20011004/DTD/xhtml-symbol.ent.
Note: Always keep in mind that, from a localization viewpoint, using numeric character references (NCRs) is better than using character entities references.
Yes. Any Unicode characters valid for XML name can be used. For example, here is a document of a Russian document type with Japanese content.
<?xml version="1.0" encoding="utf-8" ?> <Собирание версия="1.2-3"> <Объект id="12"> <НомерОбъекта>45-3454-123</НомерОбъекта> <ВНаличии>123</ВНаличии> <Описание xml:lang="ja">第二発電機</Описание> </Объект> <Объект id="64"> <НомерОбъекта>45-7894-456</НомерОбъекта> <ВНаличии>123</ВНаличии> <Описание xml:lang="ja">手動ウォーター・ポンプ</Описание> </Объект> </Собирание>
As XML names cannot use NCRs, the encoding of a document must support all characters used in the element and attribute names it contains.
Note that in XML 1.0 the definition of XML names does not allow certain languages to make use of all their characters. XML 1.1 helps in solving this potential problem.
Warning: Note that many translation tools may not be able to deal very well with non-ASCII element and attribute names.
A CSS file can be in any encoding, so most of the time,
specifying
the right encoding will allow you to use any extended character
directly in the CSS code. However, at time you may not be able to
represent a given character in its raw form, you can then use the
hexadecimal notation \HHH
where HHH
is the
Unicode value of a given character.
To avoid confusion with the hexadecimal value and the following text
(i.e. Is R\26D
R
+\26
+D
or R
+\26D
?) you may want to use a trailing space
after the value. Any first space after the hexadecimal notation is
ignored, therefore would "R\26 D
" display correctly
"R&D".
Example of CSS code with extended character in hexadecimal notation:
@import "main.css" /* Translatable data */ arguments:before { content: "Param\E8tres\0A:\0A"; font-weight: bold }
[This section needs to be updated]
URIs (Universal Resource Identifiers) can contain non-ASCII characters. The way considered the best to encode such URIs is the following:
%HH
where HH
is the hexadecimal value of the given byte.The following examples show both the raw and coded versions of URIs:
urn:TéléCom-Schémas:Facture:v2 urn:T%C3%A9l%C3%A9Com-Sch%C3%A9mas:Facture:v2 http://www.Работа.bg http://www.%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%B0.bg
More information on internationalized URIs and some code samples of conversion routines are available on the W3C Internationalization page discussing URIs and other identifiers.