Wanthalf's Lair

InterText

- parallel text alignment editor

InterText is an editor for aligned parallel texts. It has been developed for the project InterCorp to edit and manage alignments of multiple parallel language versions of texts at the level of sentences, but it is designed with flexibility in mind and supports custom XML documents and Unicode character set.

There are two different editions of InterText: InterText server and InterText editor (in development).

InterText server

InterText screenshotThe software is written in PHP and uses MySQL database as back-end.

Features:

  • can manage any number of texts
  • can manage any number of text (language) versions for each text
  • import and export of any valid XML document (see LIMITATIONS & KNOWN ISSUES!)
  • support of unicode (UTF-8) by default
  • automatic conversion of custom entities into UTF-8 characters on import
  • arbitrary alignments between any pair of (language) versions of the same text
  • one level alignment, every text version can define its own XML elements containing text to be aligned
  • integration of the 'hunalign' and 'TCA2' automatic aligners
  • import and export of alignments in TEI XML format (stand-off alignment, no conversion, see below for details)
  • opt. export of documents with 'corresp' attributes on aligned elements
  • opt. export of documents with text segments enclosed in <seg> elements (for ParaConc compatibility)
  • possibility to edit text on-the-fly when editing alignments (can be forbidden on per-text basis)
  • keeps history of all changes to the text for later revision
  • possibility to change segmentation of elements (e.g. sentences) by splitting or merging them in the alignment editor (can be forbidden on per-text basis)
  • separate possibility to prevent the change of segmentation (structure) for 'pivot' text versions
  • automatic (one- or two-level) re-numbering of text elements after change in segmentation (structure)
  • possibility to set bookmarks in the alignment and jump quickly between them
  • possibility to search for substrings, fulltext search and regular expression based search in the texts (as limited by the MySQL-engine capabilities), search for "suspicious" alignments and edited/changed elements, etc.
  • basic workflow management based on three-level user hierarchy (no own user management, uses external database of users) and three-(four)-level status of alignments
  • command-line access to the import and export functions for batch-processing
  • triggers for external scripts on the change of alignment status

Technical details on file formats:

  • the system expects use of (at most) two-part numbered id-attributes for all alignable elements, the separator can be any of : , . - _ characters; their parent elements can be numbered by single numbers; prefixes to the id-attributes are possible, but they will be stripped on import (can be regenerated on export as "long-ids" by the corresponding function in settings.php
  • the default (re-)numbering scheme is to have plain numbers for containers (parents) and two-part numbers for alignable elements, separated by a colon (e.g. "12:3" for the third sentence (element) in the 12th paragraph (container))
  • if nested containers are detected in the document, only plain (one place) numbering of the alignable elements is applied and their parents (containers) are ignored (this feature was not tested properly)
  • the file format for TEI alignment file is one single "linkGrp" element with two attributes: "toDoc" and "fromDoc"; their values should point to the filenames of the separate documents, or at least have the form: "document_name.version_name.extensions" - extensions are optional, but "document_name" and "version_name" are used to identify the aligned document versions of this alignment according to the names as declared in InterText; the "linkGrp" elements then contains "link" elements, each corresponding to one position (segment) in the alignment, with the following attributes:
    • "xtargets" is a semicolon separated list of element's id-values linked together (first a space-separated list of element id-s from the "toDoc" document, and after the semicolon a space-separated list of element id-s from the "fromDoc" document);
    • "status" is an optional attribute of the status of the link - known values are "man" (for manually confirmed link), "auto" (for automatically aligned elements), "plain" (for unaligned / unconfirmed / uknown status);
    • "mark" is used internally to preserve user bookmarks from the editor, only values 0 and 1 are known, but for 0 no attribute is generated at all
    • "type" is only generated on export for convenience, it gives a dash separated count of elements linked together by the link (e.g. "1-2")

LIMITATIONS & KNOWN ISSUES

  • the package does not contain the Hunalign nor the TCA2 automatic aligners
  • DOCTYPE, entity definitions and the XML declaration element are not imported (preserved) from the XML file (a problem of the PHP XMLReader module); the only preserved node types are: elements with their attributes (see below for exception!), text and CDATA contents, comments, processing instructions and whitespace formatting (CDATA not tested); you can (as a workaround) add your own DOCTYPE on export or otherwise modify the exported XML header by modification of the corresponding function in settings.php
  • the "id" attributes of elements are parsed and only final numbers are extracted: two placed numbers for alignable elements (e.g. "12:3") and single numbers for the other elements; the alignable elements and their parents (containers) get renumbered when the document structure is changed in the editor; in the two-level numbering mode, any other elements will just lose their id-attributes (i.e. will be cut down to any final numbers like the containers, if there were any); long id-attributes (and other id-s) can be (as a workaround) restored/re-created on export by the corresponding function defined in settings.php; by configuring InterText to use the single-level numbering of elements exclusively, the IDs of elements (except of the alignable ones) can be kept, however
  • current versions of "hunalign" are known to fail (segmentation fault) with texts larger than ca. 30000 elements (this is not a problem of InterText)

Download

http://wanthalf.saga.cz/InterText-1.7.2.zip

For more information read:

  • INSTALL - installation instructions, set-up, customization and other administration-related information
  • help.php - user-manual, including details on the functions and principles of the system
  • UPDATE.txt - instructions for update

Changelog

Release 1.7.2 (2012-02-16)

  • changed files: aligner.php, header.php, css/intertext.css, help(_cs).php

Features:

  • redesigned changelog-view (hopefully more user friendly)
  • pagination using keyboard F7 and F8 keys should now finally work across various browsers
  • switch to turn on/off permanent display of all text changes in the top bar (I guess it was there since the last version, already?)
  • updated manual reflecting the new features (help)

Fixes:

  • permissions: it was possible to revert changes from changelog even for read-only documents
  • permissions: it seemed to be possible to edit and merge elements in a read-only document

Release 1.7.1 (2011-08-17)

  • changed files: aligner.php

Fixes:

  • search and bookmarks search did only work in "auto roll" modes

Release 1.7 (2011-06-27)

  • changed files: lib_intertext.php, aligner.php, css/intertext.css, cli/export, settings.php
  • (optional changes: help.php, help_cs.php)

Features:

  • added logging of all alignment changes on demand (turned off by default; add into settings.php: $LOG_ALIGN_CHANGES=true; to activate) (no interface for this changelog; no undo!)
  • added new possible option to settings.php: $FORCE_SIMPLE_NUMBERING = true; (turned off by default) which enforces single level element renumbering and (as another side effect) makes it possible to keep original ID attributes for all elements from the original document except of the alignable ones (which *must* yield to renumbering)
  • added export with long IDs based on the (original) filename (as used by the ECPC project) (settings.php & aligner.php)
  • now recursively showing history of changes for the deleted (< merged-in) elements as well

Fixes:

  • previously "fixed" problem with importing 0:1 or 1:0 alignments from TCA2 appeared again when reversing the imported alignments

Release 1.6.1 (2011-06-21)

  • changed files: lib_intertext.php

Fixes:

  • fixed the newly added check for gaps in imported alignments (it did not really work, but broke the import) (lib_intertext.php)
  • fixed problem with importing 0:1 or 1:0 alignments from TCA2, which links those to the parent element instead (lib_intertext.php)

Release 1.6 (2011-06-01)

  • changed files: lib_intertext.php, aligner.php, icons/changelog.png
  • (optional changes: help.php, help_cs.php)

Features:

  • imported alignments are checked for completeness; if there is a gap in the alignment, the import fails
  • the changed/edited elements can show their history of changes and their contents can be replaced with some previous contents (state)(un-splitting or un-merging is still not possible but manually, though!)
  • searching for changed/edited elements
  • searching both sides (versions) at the same time

Release 1.5 (2011-05-25)

  • database structure modified! (apply 'update-1.5.sql' script)
  • changed/added files: lib_intertext.php, aligner.php, cli/import, cli/align, css/intertext.css, icons/swap.png
  • (optional changes: help.php, help_cs.php, settings.php)

Features:

  • English translation of the "user guide" (help.php), the Czech text is in 'help_cs.php'
  • imported texts remember their original filenames => imported alignment can identify the texts by their original filenames and not just the internal 'text name' and 'version name'
  • swap sides/versions in alignment (for admin)
  • import with 'cli/align' in the reverse direction
  • the supervisor (responsible) is now also allowed to edit alignment all the time (settings.php)

Fixes:

  • searching bookmarks searched in all alignments of the given text (lib_intertext)

Release 1.4 (2010-12-07)

  • initial public release

InterText editor

This is an off-line, standalone application written in C++ using the Qt toolkit. It runs on any platform supported by Qt (Linux, Mac OS X, MS Windows, etc.). The application is still in development, but a preview is available for testing and comments, which already implements features not available in the server edition.

Features:

  • import and export of ready-made alignments
  • creating new alignments from plain text or XML files (segmented or unsegmented) [NOT AVAILABLE IN THE SERVER EDITION!]
  • a simple integrated sentence splitter (fully configurable, based on regular expressions) [NOT AVAILABLE IN THE SERVER EDITION!]
  • integration of "hunalign" automatic aligner (download and install separatelly from http://mokk.bme.hu/resources/hunalign/)
  • possibility to (re)align any part of the alignment with the automatic aligner at any time
  • keeps its own local repository of alignments (per user)
  • full editing possibilities of the alignment and the element contents
  • splitting and merging of aligned elements
  • automatic detection of numbering scheme [NOT AVAILABLE IN THE SERVER EDITION!] and automatic renumbering
  • splitting and merging of parent containers (paragraphs) [NOT AVAILABLE IN THE SERVER EDITION, YET!]
  • full undo/redo [NOT AVAILABLE IN THE SERVER EDITION!]
  • synchronization of multiple alignments of one text (needs more testing, failures can be destructive!)
  • full Search & Replace functionality [NOT AVAILABLE IN THE SERVER EDITION!] (including: regular expressions with backreferences, find all (highlighting), replace all, search for element IDs, bookmarks, empty segments and non-trivial segments)
  • configuration of colors and possibility to turn off highlighting of non-1:1 alignments and bookmarks

Short help / user-guide

The application has no buttons like the server edition. It is supposed to be controlled by the menu and keyboard only. The keyboard shortcuts should be presented by the menu items, but here they are listed as well (just to be sure everything is clear):

  • arrow-keys: moving between the cells of the table
  • Enter: move text (one side) one segment down
  • Backspace: move text (one side) one segment up (i.e. merge the current segment's contents with the previous one and move the rest of the text upwards)
  • Ctrl/cmd+Enter: move both texts one position down (i.e. insert an empty segment before the current one)
  • Ctrl/cmd+Backspace: move both texts one position up (i.e. merge the whole current segment with the previous one)
  • Ctrl/cmd+Up: shift the first element (sentence) from the current segment (one side) up into the previous segment
  • Ctrl/cmd+Down: pop the last element (sentence) from the current segment (one side) down into the next segment
  • M: toggle bookmark
  • S: toggle status
  • E: start editing the current element text; splitting elements is possible in the same way as in the server edition: by inserting empty lines into the edited text, element breaks will be created automatically
  • Alt+Backspace: merge element with the previous one (only available with elements within one segment, like in the server edition!)
  • Ctrl/cmd+P: insert new parent (paragraph) break (i.e. the current element (sentence) will start a new paragraph); i.e. split the current paragraph at the current sentence
  • Ctrl/cmd+D: delete a paragraph break (available only for elements (sentences) starting a new paragraph); i.e. the current paragraph will be merged with the previous one (all of its attributes will be lost and cannot be recovered even by the Undo-function!)

Important difference to the server edition: If there are several elements (sentences) in one segment, a list selector has first to be opened by pressing the editing key "E" on the whole segment. Then, one particular element (sentence) from the segment can be chosen for editing, splitting, merging or creating or deleting a paragraph break. The application should automatically allow the user to apply only the operations allowed with the selected element.

Changelog

Beta2 0.10 (2012-01-24)

This release includes a partial, non-working implementation of synchronization with (an unreleased version of) InterText server. There is no point in experimenting with it, yet. ;-) But it should fix some strange crashes you may have experienced otherwise.

Bugfixes:

  • crash on creating alignments from XML documents without ID attributes
  • several strange sudden crashes while working with the alignment

Beta1 0.9 (2011-11-18)

Features:

  • creating new alignments from plain text or XML files (segmented or unsegmented)
  • a simple integrated sentence splitter (fully configurable, based on regular expressions)
  • integration of "hunalign" automatic aligner (download and install separatelly from http://mokk.bme.hu/resources/hunalign/)
  • possibility to (re)align any part of the alignment with the automatic aligner at any time
  • configuration of colors and possibility to turn off highlighting of non-1:1 alignments and bookmarks
  • simple alignment manager (finally you can also delete them! ;-))

Bugfixes:

  • various bugfixes

Preview 0.8 (2011-10-18)

Features:

  • Search & Replace functions
  • HTML view / rendering of contents (can be turned off)
  • Customization of the default text font

Bugfixes:

  • UTF-8 is now resistent to the locale settings (e.g. prev. corrupt encoding when saving files in Windows)
  • auto-(de)activation of editing functions ignored changing alignment (e.g. by undo/redo)
  • various other minor visual and functional improvements and optimizations…

Preview 0.7 (2011-08-01)

  • initial public release

Roadmap (may change!)

Release 1.0

  • Synchronization of alignments and texts with the InterText server (version 2.0 of the server will have to be developed as well)
  • help / user guide

Download

The application should be identical on all platforms, but small visual differences may appear, or even differences in behavior. Let me know in case you find any inconvenience.

  • MacOS X [432kB] - dynamically linked application, compiled on MacOS X 10.7 (Lion); requires installation of the Qt 4.8 toolkit [178MB]
  • Linux (64bit) [891kB] - dynamically linked application, compiled on Ubuntu 11.10; requires installation of the Qt 4.7 libraries
  • Windows (32bit) [9.9MB] - statically linked Win32 executable, compiled on MS Windows XP SP3; should have no requirements

Please, let me know if you have problems running the application on your system. It should run on any recent version of one of the operating systems (except of 32-bit Linux systems and MacOS X systems (those older than 10.6 - Snow Leopard), but I have not tested them. (Especially, the Windows version is not tested much.)

License:

This software is licensed under the GNU General Public License v3.
http://www.gnu.org/licenses/gpl-3.0.html

Copyright (c) 2010 Pavel Vondřička <Pavel.Vondricka@ff.cuni.cz>