Moses Text Filter

From Okapi Framework
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Overview

The Moses Text Filter is designed to process InlineText files used by the Moses MT system. The description of the format is provided in this document.

The following is an example of Moses InlineText file. The translatable text is highlighted in yellow, while the inline codes are highlighted in green.

Text in the first entry.
Text of the second entry<lb/>which spans<lb/>several lines
Third entry.
Fourth entry with <g id="1">bold words</g> and some code:<x id="2"/>

The example above has four lines, read as four different text units.

Inline codes are represented by <g id="N">, <x id="N">, <bx id="N"> and <ex id="N"> where N is the identifier of the code.

Line-breaks are represented by <lb/>.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has a BOM, it is used to determine the encoding.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

The output encoding of the file is always forced to UTF-8.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Original Document and Moses InlineText File

You can use the Moses InlineText Extraction Step to create Moses files from various input file formats, including XLIFF.

It is important to understand that reading a document and reading its corresponding extracted Moses file may give you different text units. The reason is that each segment (or unsegmented text unit) of the original document is extracted as a single entry in the Moses file. When a text unit of the original document count several segments, several entries are generated.

Because the Moses file does not have any mean to mark that a several entries belong to the same text unit, when you read the Moses file you will get more text unit than there is in the original document.

To know exactly to which original text unit a Moses file entry corresponds, you have to process both the original file and its corresponding Moses file. This is done, for example, in the Moses InlineText Leveraging Step where the Moses file's entries are re-grouped into their original text units.

Parameters

This filter has no parameters.

Limitations

None known.