Proper names translation : translation alignment :
Translation alignement of Le tour du monde en quatre-vingts jours (Jules Verne, 1872)
The corpus was created as part of a contrastive analysis of proper names in translation (Lecuit, 2013). It is therefore composed of a source text, Le tour du monde en quatre-vingts jours (Jules Verne, 1872), in which the proper names (as well as the relational adjectives and nouns) have been annotated thanks to the tool CasSys and to CasEN transducers developed by the computer science laboratory of the university of Tours (LI) ((Friburger & Maurel, 2004).
The tags used to annotate the text are taken from the list of tags edited by the Text Encoding Initiative Consortium (TEI P5). The following elements in the source text (in French) are therefore annotated:
- • Proper names (3342 items) :
- <name type="person">[human]</name> (1856 items)
- <name type="animal">[animal]</name> (8 items)
- <name type="org">[organization]</name> (115 items)
- <name type="geographical">[natural geographical location]</name> (201 items)
- <name type="oronym">[traffic artery]</name> (63 items)
- <name type="building">[human construction]</name> (68 items)
- <name type="place">[administrative area, town]</name> (836 items)
- <name type="object">[product]</name> (5 items)
- <name type="vessel">[vaissel]</name> (159 items)
- <name type="title" level="j">[newspaper]</name> (23 items)
- <name type="date">[historical period]</name> (3 items)
- <name type="event">[historical event]</name> (5 items)
- Relational noun : <w type="relational noun">[relational noun]</w> (197 items)
- Relational adjective : <w type="relational adjective">[relational adjective]</w> (161 items)
The corpus also comprises three target texts, translations in English, German and Serbian (Latin alphabet) respectively of the novel.
The corpus also comes with alignment files which were created thanks to the multilingual automatic aligner XAlign (developed by the Loria laboratory, which is implemented onto the Unitex platform) and that we corrected manually.
These files, which can be used with Unitex, allow for visualization of bi-texts, in the form of a window divided into two parts, with one of the two versions of the same text (horizontally aligned on the plan of translation units or translation equivalents) in each part.
- About proper names and translation :
- Lecuit É., Maurel D., Vitas D. (2011), Les noms propres se traduisent-ils ? Étude d’un corpus multilingue, Corpus, 10:201-218.
- Lecuit É. (2012). Les tribulations d'un nom propre en traduction. Étude contrastive du nom propre et de sa traduction à partir d’un corpus aligné de dix langues européennes. Thèse de doctorat de linguistique, Université François-Rabelais de Tours.
- Lecuit É., Maurel D., Vitas D. (2015). A Multilingual Corpus for the Study of Toponyms in Translation. In Schnabel-Le Corre B., Löfström J. Challenges in Synchronic Toponymy: Structure, Context and Use. Francke A. Verlag. 235-246.
- • About transducer cascades :
Origin of the ressources
- Unitex (IGM, Université Paris-Est Marne-la-Vallée, Paumier, 2011)
- CasSys et CasEN (LI, Friburger et Maurel)
- XAlign (Loria, UMR 7503)
Nature of the data
Corpus, annotated (for the French part only) and aligned, original novel and royalty-free translations.
Origin of the data
- Source text before annotation :
- (French) Le Tour du monde en 80 jours, Jules Verne (1872) : http://abu.cnam.fr/
- Target texts :
Conditions of use
The corpus is coverd by Creative Commons CC-BY-NC-SA et LGPL-LR.
The corpus is made up of five different kinds of files.
- A PDF file comprising the four aligned languages
- An XML file containing the text of the novel, annotated as mentioned above, but where the angle brackets of the name and w tags have been replaced by their HTML equivalents so as to be able to upload the document into XAlign
- An XML file containing the text of the novel, annotated as mentioned above for use without XAlign
- Three XML files, each containing one translation of the novel in English, German and Serbian respectively :
- Three XML files, containing the bi-texts alignments :
The alignments can be used with Unitex. To do so, the files should be saved beforehand in the Unitex directory, as follows :
- For the first four files, respectively : Unitex/English/Corpus/Corpus80JoursEnglish.xml, Unitex/German/Corpus/Corpus80JoursGerman.xml, Unitex/French/Corpus/Corpus80JoursFrenchXalign.xml and Unitex/Serbian/Corpus/Corpus80JoursSerbian.xml.
- For the next three files, in the Unitex/Xalign directory.
||Émeline Lecuit, Denis Maurel et Duško Vitas
||utf-8 (without BOM)
To download the PDF file you have to accept the Creative Commons CC-BY-NC-SA license.
Click here: Download the 80 jours PDF corpus (11/1/2016).
To download the PDF file you have to accept the LGPL-LR license.
Click here: (11/1/2016).