Batch Translation Step

From Okapi Framework
Jump to navigation Jump to search

Overview

This step creates a translation memory from the text units extracted from a raw document, using an external tool to provide the translation.

Takes: Raw document. Sends: Raw document

The step performs the following sequence of actions:

  • The step takes a raw document and extracts its content using its associated filter.
  • The result of the extraction is put in a temporary HTML document where each paragraph corresponds to a source segment (or a source text unit if the text units are not segmented). That temporary HTML document corresponds to the ${inputPath} and ${inputURI} variables.
  • The external tool is then invoked using a user-defined command-line. This allows the external tool to take the temporary HTML document, translate it, and create a translated output for it. The external tool must create the translated HTML and that output must correspond to the ${outputPath} and ${outputURI} variables.
  • Once the command-line is executed, the step takes the translated output and build a set of aligned entries with the source of each extracted segment and the translation provided by the external tool.
  • The translated entries are placed in either or both a Pensieve TM or a TMX document.
  • The step sends the input raw document it processed (and left unmodified) to the next step, except if Send the TMX document to the next step is specified.

Each input document can be processed using one or more temporary HTML files, allowing tools with limitation to translate very large documents.

The text unit extracted from the input document can be segmented using SRX rules if needed.

Parameters

Command line — Enter the command-line to use. The command line must take the temporary HTML document named ${inputPath} and generate an output document in the same format named ${outputPath}. You can use the following variables in the command line:

Variable Description Example
${inputPath} The full path of the input document
${inputURI} The URI of the input document
${outputPath} The full path of the output document
${srcLangName} English name of the language part of the source locale identifier For "de-ch" this returns "German"
${trgLangName} English name of the language part of the target locale identifier For "ja-jp" this returns "Japanese"
${rootDir} The root directory for this project/batch In Rainbow: the parameters folder
${inputRootDir} The root directory for the input files In Rainbow: the root directory of the input files

In addition, the command line can also use any of the locale-related variables:

  • ${srcLang} = Source language code as defined in the source Language field. For example: en-US.
  • ${srcLangU} = Source language code in uppercase. For example: EN-US.
  • ${srcLangL} = Source language code in lowercase. For example: en-us.
  • ${srcLoc} = Source locale code (language in lowercase, region in uppercase, with a _ separator). For example: en_US.
  • ${srcLocLang} = The language part of the source locale (in lowercase). For example: en.
  • ${srcLocReg} = The region part of the source locale (in uppercase). For example: US. Or empty if no region is specified.
  • ${trgLang} = Target language code as defined in the target Language field. For example: fr-CA.
  • ${trgLangU} = Target language code in uppercase. For example: FR-CA.
  • ${trgLangL} = Target language code in lowercase. For example: fr-ca.
  • ${trgLoc} = Target locale code (language in lowercase, region in uppercase, with a _ separator). For example: fr_CA.
  • ${trgLocLang} = The language part of the target locale (in lowercase). For example: fr.
  • ${trgLocReg} = The region part of the target locale (in uppercase). For example: CA. Or empty if no region is specified.

The names of the variables are case-sensitive. However, for backward compatibility, the first letter can be in uppercase (e.g. ${SrcLang}).

Any locale-related variable (i.e. has "Loc" in the variable name), has no predictable behaviour if the corresponding language code declared in the Languages and Encodings tab is not compatible with a locale notation.

Example 1: The following command-line uses the open-source Apertium program under Linux to translate the temporary HTML document.

apertium -f html ${srcLang}-${trgLang} ${inputPath} ${outputPath}

Example 2: The following command-line uses the commercial ProMT application under Windows to translate the temporary HTML document.

"C:\Program Files\PRMT9\FILETRANS\FileTranslator.exe" ${inputPath} /as /ac /d:${srcLangName}-${trgLangName} /o:${outputPath}

Block size — Enter the maximum number of text units that should be passed at the same time to the external tool. This allows you to process a very large input document even with external tools that can only process small documents.

Origin identifier — Enter an optional string that identifies the translation. The given string is output as a property of the translated entry named Origin. For example in a TMX output it will be generated as <prop type="Txt::Origin">myText</prop>, where myText is the given string.

Mark the generated translation as machine translation results — Set this option to mark the TM entries generated as the result of machine translation. For example, when this option is set, the creationId attribute of the target in the generated is set to "MT!".

Segment the text units, using the following SRX rules — Set this option to segment the extracted text unit before sending them to the temporary HTML document. If this option is set each paragraph of the HTML document will be a sentence, if this option is not set, each paragraph of the HTML document will be an un-segmented paragraph. Note that only entries processed by the external tool are placed in the TMX output. Entries that already exist in the TM being populated or in the existing TM are not copied into the TMX output.

Enter the full path to the segmentation rules file in SRX that should be used to segment the text units. You can use the variables ${rootDir} and ${inputRootDir} in the path, as well as any of the locale variables.

Import into the following Pensieve TM — Set this option to import the translated entries into a given Pensieve TM. The entries added to the TM are indexed at the end of each input document (and therefore other steps down the pipeline can only access for a given document only the entries generated with the previous documents).

Enter the directory of the Pensive TM where to import the entries. If the TM does not exist it will be created. If the TM exists already, the entries will be added to the exhisting TM. You can use the variables ${rootDir} and ${inputRootDir} in the path, as well as any of the locale variables.

Create the following TMX document — Set this option to create a TMX document with the translated entries. A single TMX file is created for all input document. The file is not generated until end of the last document (and therefore cannot be used by other steps down the pipeline).

Enter the full path of the TMX document to generate. If the file exists already it will be overwritten. You can use the variables ${rootDir} and ${inputRootDir} in the path, as well as any of the locale variables.

Send the TMX document to the next step — Set this option to send the generated TMX document as the raw-document sent to the next step, instead of the input document.

Check for existing entries in an existing TM — Set this option to lookup in an existing Pensieve TM each entry that may be send for translation. This allows to send only the entries for which you don't have an existing translation. Existing entries are not re-processed and are not placed in the optional TMX output.

Directory of the existing TM — Enter the directory of the Pensive TM to lookup for existing entries. This option is enabled only if the option Check for existing entries in an existing TM is set. You can use the variables ${rootDir} and ${inputRootDir} in the path, as well as any of the locale variables.

Limitations

None known.