Wanthalf's Lair

InterText

- parallel text alignment editor

InterText is an editor for aligned parallel texts. It has been developed for the project InterCorp to edit and manage alignments of multiple parallel language versions of texts at the level of sentences, but it is designed with flexibility in mind and supports custom XML documents and Unicode character set.

There are two completely different applications called InterText: InterText Server and InterText Editor. InterText Server is a server application with web-based interface for large projects, where a team of editors and supervisors is involved in creation of large parallel corpora. You need some server administration experience to be able to install and configure it. InterText Editor is a desktop application for personal use, but it can be also used as an off-line editor for InterText servers. It should be very easy to install and use even for common users. It also offers much more features and possibilities than InterText Server.

References

If you find InterText useful for your research, you may also refer to this article published in the LREC 2014 Proceedings (BibTeX record):

Vondřička, Pavel (2014): "Aligning parallel texts with InterText" In: Calzolari, N. et al. (ed.): Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). European Language Resources Association (ELRA). p. 1875-1879.

Support

Both applications are now freely available at GitHub, providing also a platform for support: reporting bugs, issues, complaints, feature suggestions, etc.

If you have any problems or experience issues with InterText, do not hesitate to contact the author at <Pavel.Vondricka@ff.cuni.cz>. I am always happy for feedback and providing support, even if you try using InterText for some rather unexpected purpose. You may also submit your wishes and possibly get new features implemented "for free" if my capacity allows!

InterText Server

InterText screenshotThe software is written in PHP and uses MySQL database as back-end.

Features:

  • can manage any number of texts
  • can manage any number of text (language) versions for each text
  • import and export of any valid XML document (see LIMITATIONS & KNOWN ISSUES!)
  • support of unicode (UTF-8) by default
  • automatic conversion of custom entities into UTF-8 characters on import
  • arbitrary alignments between any pair of (language) versions of the same text
  • one level alignment, every text version can define its own XML elements containing text to be aligned
  • integration of the 'hunalign' and 'TCA2' automatic aligners
  • import and export of alignments in TEI XML format (stand-off alignment, no conversion, see below for details)
  • opt. export of documents with 'corresp' attributes on aligned elements
  • opt. export of documents with text segments enclosed in <seg> elements (for ParaConc compatibility)
  • possibility to edit text on-the-fly when editing alignments (can be forbidden on per-text basis)
  • keeps history of all changes to the text for later revision
  • possibility to change segmentation of elements (e.g. sentences) by splitting or merging them in the alignment editor (can be forbidden on per-text basis)
  • possibility to split or merge container (parent) elements (e.g. paragraphs)
  • separate possibility to prevent the change of segmentation (structure) for 'pivot' text versions
  • automatic (one- or two-level) re-numbering of text elements after change in segmentation (structure)
  • possibility to set bookmarks in the alignment and jump quickly between them
  • possibility to search for substrings, fulltext search and regular expression based search in the texts (as limited by the MySQL-engine capabilities), search for "suspicious" alignments and edited/changed elements, etc.
  • basic workflow management based on three-level user hierarchy (can use external database of users) and three-(four)-level status of alignments
  • command-line access to the import and export functions for batch-processing
  • triggers for external scripts on the change of alignment status
  • synchronization of texts and alignments with external InterText editor clients

Download

Use GitHub to obtain the latest code.

Documentation:

  • Technical details and limitations
  • INSTALL - installation instructions, set-up, customization and other administration-related information
  • help.php - user-manual, including details on the functions and principles of the system
  • UPDATE.txt - instructions for update
  • ChangeLog

InterText editor

InterText editorThis is an off-line, standalone application written in C++ using the Qt toolkit. It runs on any platform supported by Qt (Linux, Mac OS X, MS Windows, etc.).

Features:

  • import and export of ready-made alignments
  • support for cross-order alignments
  • synchronization of texts and alignments with InterText server, possibility to align text versions from server with local files
  • creating new alignments from plain text or XML files (segmented or unsegmented)
  • creating new alignments with empty target ("write your own translation on-the-fly")
  • import of new-line aligned texts
  • a simple integrated sentence splitter (fully configurable, based on regular expressions)
  • integration of "hunalign" automatic aligner
  • possibility to (re)align any part of the alignment with the automatic aligner at any time
  • keeps its own local repository of alignments (per user)
  • full editing possibilities of the alignment and the element contents
  • splitting and merging of aligned elements
  • automatic detection of numbering scheme and automatic renumbering
  • splitting and merging of parent containers (paragraphs)
  • full undo/redo
  • synchronization of multiple alignments of one text (needs more testing, failures can be destructive!)
  • full Search & Replace functionality (including: regular expressions with backreferences, find all (highlighting), replace all, search for element IDs, bookmarks, empty segments and non-trivial segments)
  • fully configurable custom export of text contents (pre-defined profiles for new-line aligned texts, ParaConc, TMX)
  • configuration of colors and possibility to turn off highlighting of non-1:1 alignments and bookmarks
  • configurable transformations for visualization of complex or non-HTML marked text contents
  • runs on Linux, MacOs X, MS Windows

Documentation

Download

Binary distribution packages

This version has its own installer and PackageManager and contains also documentation and hunalign. It should have no special dependencies (in case of troubles, please, check documentation above). If upgrading, have a look to the notes in the ChangeLog.

  • MacOS X Installer requires MacOS 10.8 or newer (not signed - if using MacOS X 10.9 Mavericks or newer, you may need to enable running applications "from any source" in your security settings)
  • Linux (64-bit) Installer (set executable flag to run; in case of problems, check documentation for possible dependencies)
  • MS Windows Installer

Source code

Get access to the source code at GitHub.

Acknowledgement:

This software and documentation resulted from the implementation of the Czech National Corpus project (LM2011023) funded by the Ministry of Education, Youth and Sports within the framework of Large Research, Development and Innovation Infrastructures.

License:

This software is licensed under the GNU General Public License v3.
http://www.gnu.org/licenses/gpl-3.0.html

Copyright (c) 2010-2017 Pavel Vondřička <Pavel.Vondricka@ff.cuni.cz>
Copyright (c) 2010-2017 Charles University, Faculty of Arts, Institute of the Czech National Corpus