OpenOffice Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

The OpenOffice Filter is an Okapi component that implements the IFilter interface for OpenOffice.org documents: ODT (text), ODS (spreadsheet), ODP (slides), ODG (graphics), and their corresponding template formats.

These documents use the OpenDocument format (ODF). If you need to process directly an XML ODF file, you can use the ODF Filter that the OpenOffice Filter uses internally.

Processing Details

Encodings

The input encoding is automatically detected.

Any user-specified output encoding is ignored by these filters. They always use UTF-8.

Line-Breaks

The type of line-breaks of the output is always set to a simple linefeed (LF).

Sub-Documents

An OpenOffice documents is a ZIP file with several documents inside. The main one (content.xml) contains the body of the data. But other files may also contain translatable text: meta.xml and style.xml.

All the different embedded files are treated as sub-documents by the filter. This means that, for example, when represented in XLIFF, a single ODT extracted to a single XLIFF document is made up three XLIFF <file> elements: One for content.xml, one for style.xml, and one for meta.xml. Note that very often, only content.xml has extracted text.

Parameters

Extract notes — Set this option to extract the content of <office:annotation> elements (notes) as translatable text. If this option is not set, notes are not extracted.

Extract references — Set this option to extract the content of <text:bookmark-ref> elements. the content of these element is only a copy of the content of the referent. It is updated automatically within OpenOffice, so any translation done for these content will be automatically overwritten as soon as the document is updated. However, in some cases it may be useful to be able to have the referenced text as part of the segment where it is inserted.

Extract Metadata — Extract metadata in meta.xml (default true)

Encode Character Entity Reference Glyphs — (default true)


Limitations

  • Some deleted text may get extracted. Make sure you have accepted or rejected the revisions changes before processing the input document, as currently some text marked as deleted is still extracted.
  • The options to extract or not the notes and the references is not working yet. They wll be extracted regardless of the option settings.
  • Sequential tabs may get reduced to a single tab during an extraction and merge round trip: The elements for spaces and tabs are supported in output but still incorrectly handled on input.
  • The target (output) encoding must be set to UTF-8 when extracting the documents to merge them back properly.

Please, report any other issues to the Issues List of the project.