Proper names translation : translation alignment :
Translation alignement of Le tour du monde en quatre-vingts jours (Jules Verne, 1872)

Laboratoire ligérien de linguistique, université d'Orléans et de Tours Laboratoire d'informatique de l'Université François-Rabelais de Tours Faculté de mathématiques de l'Université de Belgrade

Version française

Corpus presentation

The corpus was created as part of a contrastive analysis of proper names in translation (Lecuit, 2013). It is therefore composed of a source text, Le tour du monde en quatre-vingts jours (Jules Verne, 1872), in which the proper names (as well as the relational adjectives and nouns) have been annotated thanks to the tool CasSys and to CasEN transducers developed by the computer science laboratory of the university of Tours (LI) ((Friburger & Maurel, 2004).

The tags used to annotate the text are taken from the list of tags edited by the Text Encoding Initiative Consortium (TEI P5). The following elements in the source text (in French) are therefore annotated:

The corpus also comprises three target texts, translations in English, German and Serbian (Latin alphabet) respectively of the novel.

The corpus also comes with alignment files which were created thanks to the multilingual automatic aligner XAlign (developed by the Loria laboratory, which is implemented onto the Unitex platform) and that we corrected manually.

These files, which can be used with Unitex, allow for visualization of bi-texts, in the form of a window divided into two parts, with one of the two versions of the same text (horizontally aligned on the plan of translation units or translation equivalents) in each part.

References

Origin of the ressources

Nature of the data

Corpus, annotated (for the French part only) and aligned, original novel and royalty-free translations.

Origin of the data

Conditions of use

The corpus is coverd by Creative Commons CC-BY-NC-SA et LGPL-LR.

Use

The corpus is made up of five different kinds of files.

  1. A PDF file comprising the four aligned languages
    • Corpus80Jours.pdf
  2. An XML file containing the text of the novel, annotated as mentioned above, but where the angle brackets of the name and w tags have been replaced by their HTML equivalents so as to be able to upload the document into XAlign
    • Corpus80JoursFrench_Xalign.xml
  3. An XML file containing the text of the novel, annotated as mentioned above for use without XAlign
    • Corpus80JoursFrench.xml
  4. Three XML files, each containing one translation of the novel in English, German and Serbian respectively :
    • Corpus80JoursEnglish.xml
    • Corpus80JoursGerman.xml
    • Corpus80JoursSerbian.xml
  5. Three XML files, containing the bi-texts alignments :
    • Corpus80JoursFrenchEnglish.xml
    • Corpus80JoursFrenchGerman.xml
    • Corpus80JoursFrenchSerbian.xml

The alignments can be used with Unitex. To do so, the files should be saved beforehand in the Unitex directory, as follows :

Technical sheet

Version 1.1
Design Émeline Lecuit, Denis Maurel et Duško Vitas
Format XML-TEI
CCharacter encoding utf-8 (without BOM)

Download

To download the PDF file you have to accept the Creative Commons CC-BY-NC-SA license.

Licence CC-BY-NC-SA

Click here: Download the 80 jours PDF corpus (11/1/2016).

To download the PDF file you have to accept the LGPL-LR license.

Click here: (11/1/2016).