opentag.com a place for localization tools and technologies |
\\ XML and Localization :: FAQ :: Language Identification | ||||||||||||||||||||||
You will find here the answers to some of the frequently asked questions about character representation in XML and related technologies. If you find any mistakes or have suggestions for additional useful information, please send an email.
What is the
|
ISO 639 1 | ISO 639 2T | ISO 639 2B | xml:lang | Language |
---|---|---|---|---|
la | lat | lat | la | Latin |
de | deu | ger | de | German |
- | mni | mni | mni | Manipuri |
Note that the values for xml:lang
are not
case-sensitive, unlike the other XML attribute values. For example: "en-pg"
is the same as "EN-pg"
, "en-PG"
,
"EN-PG"
, "En-Pg"
, or any
other case combination.
You may also use IANA-defined or user-defined codes as documented in the RFC. However, this is not recommended at all, as the vast majority of the tools will not recognize such codes. The list of the IANA registered language codes is available at http://www.isi.edu/in-notes/iana/assignments/languages.
Note that this current language identification system is not sufficient for everyone. While it works fine in the majority of the "main" languages used in business, many languages are not covered. there is also no easy solution to specify languages associated with something else than a country (for example: no standard for Latin-American Spanish). See The Ethnologue Web site (from SIL International) for a more complete list of the 6,000+ languages used in the World.
lang
or xml:lang
?In XHTML 1.0 you should use both. The lang
attribute will be
handy if the document gets viewed by a browser without XML support, while xml:lang
will be handy when the document gets processed by XML tools. The values of
both attributes obey the same rules. If both attributes are present but are set to different language, xml:lang
prevails.
In XHTML 1.1 you use only xml:lang
.
Do use the language attribute: In
some case this is the only way for the browsers to know how to represent
the characters correctly. For example, many Han (Chinese, Japanese, Korean)
characters are the same but require different fonts or rendering rules,
having the language of the content identified in the root element (e.g. <html>
)
can make a big difference in the display.
Example of XHTML document with language attributes:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head><title>Minimal XHTML file</title></head> <body> <p lang="fr" xml:lang="fr">Du texte en français.</p> </body> </html>
You can have different languages within the same document. If you have
such documents, make sure to use the xml:lang
attribute to
identify the language of each part of the content.
However, from a localization viewpoint, such documents are not currently easy to localize and their use is not recommended if you do not have a clear and efficient process in place to handle their translation.
See an example of XML document in 15 languages: With a simple style sheet, and without a style sheet. (You need a browser with XML and CSS support to display both documents correctly. Note also that some characters may be represented as blocks if you do not have a font to display them).
No, the use of Unicode Language Tag characters in XML (or any other
markup language) is strongly discouraged. Use xml:lang
to specify the language of a content.
The Unicode Language Tags characters (Unicode values in the range U+E0000-U+E007) are a set of characters reserved to identify languages in the coding of plain text for specific protocols. For more information see the Unicode Technical report #20.
lang()
function in XPath?XPath offers a dedicated function to identify the language of a given
node: lang()
. The function returns true if the node has a xml:lang
attribute set that matches the language code you specified.
The matching is based on sub-string comparison of the start of the
value. For example: lang('en')
will match xml:lang="en"
,
xml:lang="en-gb"
, and xml:lang="en-us"
,
but will not match xml:lang="x-en-pidgin"
. The same
way lang('en-pg')
will match xml:lang="en-pg"
,
but not xml:lang="en"
.
The matching is not case-sensitive. For example: lang('En-US')
will match all case combinations, such as xml:lang="EN-us"
,
xml:lang="en-us"
, xml:lang="eN-uS"
,
and so forth.
Example of usage: The following XML document contains <Text>
elements in different languages.
<?xml version="1.0" encoding="iso-8859-1" ?> <?xml-stylesheet type="text/xsl" href="Languages.xsl" ?> <MyData> <Msg id="100"> <Text xml:lang="en">Message 100 in English.</Text> </Msg> <Msg id="200"> <Text xml:lang="en-us">Message 200 in American English.</Text> <Text xml:lang="fr-CA">Message 200 en Québecquois.</Text> </Msg> <Msg id="300"> <Text xml:lang="fr">Message 300 en français.</Text> </Msg> <Msg id="400"> <Text xml:lang="EN-GB">Message 400 in British English.</Text> </Msg> </MyData>
The following XSL template will output only the content of the <Text>
elements that match the value set for the OutLang
parameter
(here: 'en'
).
<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:param name="OutLang">en</xsl:param> <xsl:template match="Text"> <xsl:if test="lang($OutLang)"> <p><xsl:value-of select="."/> (<xsl:value-of select="@xml:lang"/>)</p> </xsl:if> </xsl:template> </xsl:stylesheet>
Display the document (you need a browser that support XML and XSL).
lang()
selector in CSS?CSS-2 offers a pseudo-class selector specific for language (see the specifications).
It uses the following syntax:
<Element>:lang(<LangCode>)
Where <Element>
is the element selector, and <LangCode>
is the language code as defined by the RFC
1766 (superceded by RFC
3066 in January 2001).
For example:
*:lang(zh) { font-family: SimSun; } *:lang(zh-tw) { font-family: MingLiU; } emph:lang(en) { font-weight: bold; } emph:lang(iu) { text-decoration: underline; }
The determination of what is the language of a given content depends on
the format with which the CSS style-sheet in used, for instance: xml:lang
in XML.