opentag.com - XML FAQ: Language Identification

\\ XML and Localization :: FAQ :: Language Identification

You will find here the answers to some of the frequently asked questions about character representation in XML and related technologies. If you find any mistakes or have suggestions for additional useful information, please send an email.

What is the xml:lang attribute?
Do I need to declare the xml:lang attribute?
What are the values for the xml:lang attribute?
In XHTML should I use lang or xml:lang?
What about multilingual documents?
Can I use Unicode Language Tags in XML?
How do I use the lang() function in XPath?
How do I use the lang() selector in CSS?

What is the `xml:lang` attribute?

The attribute xml:lang is a reserved attribute of XML to specify the language of a given content. Its purpose is to allow all different XML document types to use the same attribute for language identification.

The xml:lang attribute applies to all attributes (regardless their order) and content of the element where it appears and all children of that element, except when overwritten. For example:

<?xml version="1.0" encoding="utf-8" ?>
<doc xml:lang="en">
 <list title="Titre en français" xml:lang="fr">
  <p>Texte en français.</p>
  <p xml:lang="fr-ca">Texte en québécquois.</p>
  <p xml:lang="en">Second text in English.</p>
 </list>
 <p>Text in English.</p>
</doc>

Always use the xml:lang attribute to specify the language of a content: its use is assumed in many other XML-related technologies. For example, the lang() function in XPath.

Do I need to declare the `xml:lang` attribute?

Yes, if you have a DTD and validate your documents, you must declare xml:lang like any other attribute. For example, in a DTD you would declare it such as this:

<!ATTLIST para xml:lang NMTOKEN>

If you use simple well-formed documents, the xml namespace in xml:lang is assumed and you do not need to declare it.

What are the values for the `xml:lang` attribute?

The values of xml:lang are the language tags as defined in BCP 47. (The original XML specifications mention RFC 1766, but that RFC has been superceded several times since).

The most used value can be defined is as follow:

<ISOLangCode>[-<ISOCountryCode>]

Where:
<ISOLangCode> is an ISO language 2-letter code (ISO 639-1) or 3-letter code (ISO 639-2T).
<ISOCoutryCode> is an ISO country code (ISO 3166).
and the dash '-' (U+002D) is the separator (not underscore '_' (U+005F)).

See http://www.loc.gov/standards/iso639-2/php/code_list.php for the official list of the ISO language codes, and see http://www.iso.org/iso/en/prods-services/iso3166ma/index.html for the official list of the ISO country codes.

Usage of the ISO language codes:

Use the 2-letters codes.
If a 2-letter code does not exist for a given language, use the Terminology form of the 3-letter code (ISO 639-2t), not the Bibliography form (ISO 639-2b).

For example:

ISO 639 1	ISO 639 2T	ISO 639 2B	xml:lang	Language
la	lat	lat	la	Latin
de	deu	ger	de	German
-	mni	mni	mni	Manipuri

Note that the values for xml:lang are not case-sensitive, unlike the other XML attribute values. For example: "en-pg" is the same as "EN-pg", "en-PG", "EN-PG", "En-Pg", or any other case combination.

You may also use IANA-defined or user-defined codes as documented in the RFC. However, this is not recommended at all, as the vast majority of the tools will not recognize such codes. The list of the IANA registered language codes is available at http://www.isi.edu/in-notes/iana/assignments/languages.

Note that this current language identification system is not sufficient for everyone. While it works fine in the majority of the "main" languages used in business, many languages are not covered. there is also no easy solution to specify languages associated with something else than a country (for example: no standard for Latin-American Spanish). See The Ethnologue Web site (from SIL International) for a more complete list of the 6,000+ languages used in the World.

In XHTML should I use `lang` or `xml:lang`?

In XHTML 1.0 you should use both. The lang attribute will be handy if the document gets viewed by a browser without XML support, while xml:lang will be handy when the document gets processed by XML tools. The values of both attributes obey the same rules. If both attributes are present but are set to different language, xml:lang prevails.

In XHTML 1.1 you use only xml:lang.

Do use the language attribute: In some case this is the only way for the browsers to know how to represent the characters correctly. For example, many Han (Chinese, Japanese, Korean) characters are the same but require different fonts or rendering rules, having the language of the content identified in the root element (e.g. <html>) can make a big difference in the display.

Example of XHTML document with language attributes:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
          "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head><title>Minimal XHTML file</title></head>
 <body>
  <p lang="fr" xml:lang="fr">Du texte en français.</p>
 </body>
</html>

What about multilingual documents?

You can have different languages within the same document. If you have such documents, make sure to use the xml:lang attribute to identify the language of each part of the content.

However, from a localization viewpoint, such documents are not currently easy to localize and their use is not recommended if you do not have a clear and efficient process in place to handle their translation.

See an example of XML document in 15 languages: With a simple style sheet, and without a style sheet. (You need a browser with XML and CSS support to display both documents correctly. Note also that some characters may be represented as blocks if you do not have a font to display them).

Can I use Unicode Language Tags in XML?

No, the use of Unicode Language Tag characters in XML (or any other markup language) is strongly discouraged. Use xml:lang to specify the language of a content.

The Unicode Language Tags characters (Unicode values in the range U+E0000-U+E007) are a set of characters reserved to identify languages in the coding of plain text for specific protocols. For more information see the Unicode Technical report #20.

How do I use the `lang()` function in XPath?

XPath offers a dedicated function to identify the language of a given node: lang(). The function returns true if the node has a xml:lang attribute set that matches the language code you specified.

The matching is based on sub-string comparison of the start of the value. For example: lang('en') will match xml:lang="en", xml:lang="en-gb", and xml:lang="en-us", but will not match xml:lang="x-en-pidgin". The same way lang('en-pg') will match xml:lang="en-pg", but not xml:lang="en".

The matching is not case-sensitive. For example: lang('En-US') will match all case combinations, such as xml:lang="EN-us", xml:lang="en-us", xml:lang="eN-uS", and so forth.

Example of usage: The following XML document contains <Text> elements in different languages.

<?xml version="1.0" encoding="iso-8859-1" ?>
<?xml-stylesheet type="text/xsl" href="Languages.xsl" ?>
<MyData>
 <Msg id="100">
  <Text xml:lang="en">Message 100 in English.</Text>
 </Msg>
 <Msg id="200">
  <Text xml:lang="en-us">Message 200 in American English.</Text>
  <Text xml:lang="fr-CA">Message 200 en Québecquois.</Text>
 </Msg>
 <Msg id="300">
  <Text xml:lang="fr">Message 300 en français.</Text>
 </Msg>
 <Msg id="400">
  <Text xml:lang="EN-GB">Message 400 in British English.</Text>
 </Msg>
</MyData>

The following XSL template will output only the content of the <Text> elements that match the value set for the OutLang parameter (here: 'en').

<?xml version="1.0" ?>
<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
 <xsl:param name="OutLang">en</xsl:param>
 <xsl:template match="Text">
  <xsl:if test="lang($OutLang)">
   <p><xsl:value-of select="."/>
   (<xsl:value-of select="@xml:lang"/>)</p>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

Display the document (you need a browser that support XML and XSL).

How do I use the `lang()` selector in CSS?

CSS-2 offers a pseudo-class selector specific for language (see the specifications).

It uses the following syntax:

<Element>:lang(<LangCode>)

Where <Element> is the element selector, and <LangCode> is the language code as defined by the RFC 1766 (superceded by RFC 3066 in January 2001).

For example:

*:lang(zh)    { font-family: SimSun; }
*:lang(zh-tw) { font-family: MingLiU; }
emph:lang(en) { font-weight: bold; }
emph:lang(iu) { text-decoration: underline; }

The determination of what is the language of a given content depends on the format with which the CSS style-sheet in used, for instance: xml:lang in XML.

What is the xml:lang attribute?

Do I need to declare the xml:lang attribute?

What are the values for the xml:lang attribute?

In XHTML should I use lang or xml:lang?