SRX and Java

From OkapiWiki

Jump to: navigation, search

The SRX 2.0 standard is based on the ICU regular expression notation.

many Java applications use Java's regular expressions to implement SRX because ICU4J (ICU for Java) does not provide support of ICU regular expressions.

As of version 1.6 Java does not have support for some of the Unicode-enabled features as described in ICU. For example in Java "\w" means "[a-zA-Z_0-9]" not "[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]" like in ICU. Some ICU features can be replaced by an equivalent expression in Java, but some other features simply cannot be implemented in Java.

The following table shows the ICU and Java differences. The yellow entries denote a case where the ICU expression needs to be mapped to a Java equivalent (sometimes a complex one), and the red entries indicate the cases where the ICU expression cannot be mapped in Java.

Note: Starting in M16, the Okapi implementation of SRX does use the ICU patterns, no the Java patterns. (You can test this for example in Ratel).


ICU Meta Character Java Equivalent ICU Description
\a same Match a BELL, \u0007
\A same Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input.
\b, outside of a set \b exists but does not have exactly the same behavior. Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. And the option UREGEX_UWORD is assumed to be NOT set (default).
\b, within a set \b is invalid when within a set.
Use \u0008 instead.
Match a BACKSPACE, \u0008.
\B \B exists but does not have exactly the same behavior. Match if the current position is not a word boundary. And the option UREGEX_UWORD is assumed to be NOT set (default).
\cX same Match a control-X character.
\d \d exists but is ASCII based.
Use [\p{Nd}] instead.
Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
\D \D exists but is ASCII based.
Use [^\p{Nd}] instead.
Match any character that is not a decimal digit.
\e same Match an ESCAPE, \u001B.
\E same Terminates a \Q ... \E quoted sequence.
\f same Match a FORM FEED, \u000C.
\G same Match if the current position is at the end of the previous match.
\n same Match a LINE FEED, \u000A.
\N{UNICODE CHARACTER NAME} Does not exists Match the named character.
\p{UNICODE PROPERTY NAME} same Match any character with the specified Unicode Property.
\P{UNICODE PROPERTY NAME} same Match any character not having the specified Unicode Property.
\Q same Quotes all following characters until \E.
\r same Match a CARRIAGE RETURN, \u000D.
\s \s exists but is ASCII based (it matches [ \t\n\x0B\f\r])
Use [\t\n\f\r\p{Z}] instead.
Match a white space character. White space is defined as [\t\n\f\r\p{Z}].
\S \S exists but is ASCII based
Use [^\t\n\f\r\p{Z}] instead.
Match a non-white space character.
\t same Match a HORIZONTAL TABULATION, \u0009.
\uhhhh same Match the character with the hex value hhhh.
\Uhhhhhhhh Does not exist Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\w \w exists but is ASCII based.
Use [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}] instead.
Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
\W \W exists but is ASCII based
Use [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}] instead.
Match a non-word character.
\x{hhhh} Does not exists
Use \uhhhh instead.
Match the character with hex value hhhh
\xhh same Match the character with two digit hex value hh
\X Does not exists Match a Grapheme Cluster.
\Z same Match if the current position is at the end of input, but before the final line terminator, if one exists.
\z same Match if the current position is at the end of input.
\0nnn same Match the character with octal value nnn.
\n same Back Reference. Match whatever the nth capturing group matched. n must be >1 and < total number of capture groups in the pattern.
[pattern] same Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern.
. same Match any character.
^ same Match at the beginning of a line.
$ same Match at the end of a line.
\ same Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox