Moses Text Filter

From OkapiWiki

Jump to: navigation, search

Contents

Overview

The Moses Text Filter is designed to process InlineText files used by the Moses MT system. The description of the format is provided in this document.

The following is an example of Moses InlineText file. The translatable text is highlighted in yellow, while the inline codes are highlighted in green.

Text in the first entry.
Text of the second entry<lb/>which spans<lb/>several lines
Third entry.
Fourth entry with <g id="1">bold words</g> and some code:<x id="2"/>

The example above has four lines, read as four different text units.

Inline codes are represented by <g id="N">, <x id="N">, <bx id="N"> and <ex id="N"> where N is the identifier of the code.

Line-breaks are represented by <lb/>.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

Output Encoding

The output encoding of the file is always forced to UTF-8.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Original Document and Moses InlineText File

You can use the Moses InlineText Extraction Step to create Moses files from various input file formats, including XLIFF.

It is important to understand that reading a document and reading its corresponding extracted Moses file may give you different text units. The reason is that each segment (or unsegmented text unit) of the original document is extracted as a single entry in the Moses file. When a text unit of the original document count several segments, several entries are generated.

Because the Moses file does not have any mean to mark that a several entries belong to the same text unit, when you read the Moses file you will get more text unit than there is in the original document.

To know exactly to which original text unit a Moses file entry corresponds, you have to process both the original file and its corresponding Moses file. This is done, for example, in the Moses InlineText Leveraging Step where the Moses file's entries are re-grouped into their original text units.

Parameters

This filter has no parameters.

Limitations

None known.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox