Microsoft Batch Translation Step

From OkapiWiki

Jump to: navigation, search

Contents

Overview

This step annotates text units of the input documents with Microsoft Translator candidates or/and creates a TM from them.

Takes: Filter events. Sends: Filter events (possibly annotated) or raw document.

You must have a "Client ID" and a "Client Secret" from Microsoft to use this step. If you get those by obtaining a Windows Live ID, and then registering an application in your Live account. See the MSDN pages for more information.

You must also respect Microsoft's Terms of Service. If you intend to use the Microsoft Translator API for commercial or high volume purposes, you would need to sign a commercial license agreement and provide your AppID to the Microsoft Translator team. For more details contact mtlic@microsoft.com.

Text units flagged as non-translatable are not send for translation.

Note that using the Leveraging Step with the Microsoft Translator Connector will produces MT results similar to this step. However, this step can process several text units at once and therefore is much faster.

Improving automatically MT output can be done in some cases. For example extra or missing spaces around inline codes can be fixed with the Space Check Step.

Parameters

Client ID — The Client ID to use to connect to the MT server. See the MSDN pages for more information.

Client Secret — The secret corresponding to the Client ID.

Category — An optional category to use when working with trained engines. You can either enter directly the engine identifier (called 'category' in Microsoft Translator Hub), or you can use a keyword in the form @@@keyword@@@. If you specify a keyword you must specify a properties file in the Engine Mapping field.

The keyword can be a literal string or the ${domain} variable. When ${domain} is used, the variable is replaced by the first occurrence of the value for the ITS Domain annotation found on a text unit. Ideally this Domain annotation should be set on the first text unit of the first document processed. All batches of events translated before a domain annotation is found are translated with the empty category.

Note: As stated above, only the first occurrence of the Domain annotation has an effect on the selection of the engine.
Note: Also, because this step is working on batches, segments before the first occurrence of the Domain annotation but within the same batch will be translated with the engine specified by the domain. For example: If you have 100 events and the Events buffer is set to 50 and the first occurrence of the Domain annotation is in the 60th event: The first 50 events will be translated with the empty category and all the other events with the engine corresponding to the specified domain, including the events 51 to 59.

Engine Mapping — Enter the path of the properties file that contains the mapping between the category keywords and the Microsoft Translator Hub engine identifier. You can use the variables ${rootDir} and ${inputRootDir}, as well as any of the source or target locale variables (${srcLoc}, ${trgloc}, etc). Leave the path empty to not use a mapping. The properties file is a list of lines in the form:

<keyword>.<language>=<engineID>

Where:

For example, if you have the following engine mapping file:

travel.FR=11111111-2222-3333-4444-e42f530c98b8_tra
client1.DE=11111111-2222-3333-4444-90dd26cc9dsd3_gen
client1.tech.DE=11111111-2222-3333-4444-90dd26cc9d48_tech
client2.DE=11111111-2222-3333-4444-90dd26cc9ds34_gen

To use the first engine (assuming you are translating into french), specify @@@travel@@@ in the Category. To use the third engine specify @@@client1.tech@@@. There is also a fallback mechanism where if you specify @@@client2.law@@@ it would first look for client2.law.DE and if not found it would look for a client2.DE. If no custom engine is found the generic Microsoft provided engine is used.

Events buffer — Enter the number of events to buffer for a single query to the engine. The largest the buffer, the fastest the processing. But there are limitations related to the volume of text you can process at once as well.

Maximum matches — Enter the maximum number of matches you want to allow per source text.

Threshold — Enter the score below which a match is not keep as a result. See the Microsoft Translator Connector to understand how scores are computed based on their match degree an rating values.

Query only entries without existing candidate — Set this option to send to Microsoft Translator only the text for which there is currently no candidate (i.e. annotations added by previous steps or coming from the original document).

Annotate the text units with the translations — set this option to add to the text units annotations that holds the matches found. Those annotations may be used later by other steps. Existing annotations are preserved.

Generate a TMX document — Set this option to create a TMX output. Enter the full path of the TMX document to generate. If another document exists already it will be overwritten. You can use the variables ${rootDir} and ${inputRootDir}, as well as any of the source or target locale variables (${srcLoc}, ${trgloc}, etc).

Send the TMX document to the next step — Set this option to have the generated TMX document as the only input raw document passed on to the next step of the pipeline. If this option is not set the filter events are passed on to the next step.

Mark the generated translation as machine translation results — Set this option to mark the TM entries generated as the result of machine translation. For example, when this option is set, the creationId attribute of the TMX <Tu> element is set to "MT!".

Fill the target with the best translation candidate — Set this option to copy the translation with the best score (and a score above or equal to the Fill threshold) into the target (if it is empty). Only the matches returned by the Microsoft Translator engine are taken into account.

Fill threshold — If the score of the best match is below the provided value, no translation is not copied into the target.

Limitations

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox