Transducer Cascade CasEN
for Named Entity Recognition

Laboratoire d'informatique fondamentale et appliquée de l'université de Tours

French version

Description of CasEN

CasEN is made available on the plateform Unitex as part of the projects ANR Variling, FEDER Région Centre Entités nommées et nommables, Ortolang and Istex.

The cascade CasEN recognises named entities by using lexical resources and local descriptions of patterns, transducers that act on the text by insertions, replacements or deletions. These actions can be eventually iterative. They can be used "on the fly" on a particular text based on the results of previous transducers. The plateform Unitex allows easy creation and maintenance of these transducers by presenting them to the user in form of graphs. The aim of a casade it to utilise the patterns already identified, or, on the contrary, to avoid tagging a pattern already recognised. Thus, the order in which the transducers are passed is an important parameter.

The graphs subsequently call subgraphs that are:

Graphs can be constructed automatically for the text under consideration from generic graphs. These graphs permit retrieval of an entity without a local context, provided this entity had been identified elsewhere in the text by one of the previous graphs.

An Example of Tagging with CasEN

The sentence:

Prince John, in the meanwhile, occupied his castle, and disposed of his domains without scruple;

extracted from the corpus distributed by Unitex (Ivanhoe, by Sir Walter Scott) is transformed by:

to give (file ivanhoe_snt.raw):

Prince {\{John\,\.name\+last\+grftagLastName\},.entity+pers+ind+grfpersProfession}, in the meanwhile, occupied his castle, and disposed of his domains without scruple;

This format enables the display of concordance, but is hardly readible. Therefore, another resultant file is available in a XML-CasSys (file ivanhoe_snt.txt). This example is:

Prince
<csc>
   <form>
     <csc>
        <form>John</form>
        <code>name</code>
        <code>last</code>
        <code>grftagLastName</code>
     </csc>
   </form>
   <code>entity</code>
   <code>pers</code>
   <code>ind</code>
   <code>grfpersProfession</code>
</csc>,
in the meanwhile, occupied his castle, and disposed of his domains without scruple;

A recognised sequence is, on one hand, tagged and, on the other hand, frozen in a polylexical expression. This annotation can be later serached in Unitex by more or less specific masks. For example, from the graph above, <entity>, <pers> or <ind>. To enable the debugging, we add the name of the graph that had inserted it, prefixed by grf, here grfpersProfession.

If the output XML-CasSys does not correspond to the desired annotation (which is generally the case), the file _csc.txt can be opened in Unitex and treated with a second cascade. Hence, CasEN is composed of two cascades, one for analysis and the second one for synthesis. For our example, and for the synthesis version Istex, the result of the second cascade is:

Prince <persName>John</persName>, in the meanwhile, occupied his castle, and disposed of his domains without scruple;

The Order of Graphs

The cascade itself contains blocks of affirmations which are possible to be retrieved... For example, the sentence:

He arrived on 29 February 2008.

can be analysed by two graphs of CasEN:

One must apply the graph timeAbsoluteCalendarDateYear before the other.

Sometimes, it is not about occurence, but about complement. The most simple example is without a doubt the graph of postal addresses that has patterns of person (to identify Franklin D. Roosevelt Street): the graphs of persons are thus placed before the graph of addresses. Various organisations also comprise of tags of type person, such as Rockefeller Center or Lincoln Hospital. These organisations are therefore recognized after the graphs of persons. Hence, the order of graphs is important.

CasEN_Istex

Under the Istex project, CasEN is supplemented for French by named entity recognition in scientific texts (as explained below). But this project deals also (and essentially) with texts written in English, this lead to the creation of a new cascade for this corpus and as a result can be used for other corpora in English.

The annotations of the cascade of analyse are borrowed from the TEI and the cascade of synthesis follow the Istex annotation guide. These two cascades are available below. Their evaluation, carried out in parallel with that of the French version, will be soon available.

If there is any remarks or bugs, please write to casen At univ-tours Dot fr.

Download CasEN_Istex

Ensure that Unitex is up to date. One must work with Unitex 3.1 (stable version) or better 3.2 alpha (or 3.2 stable version soon). When you unzip the file in your Unitex directory, the files will be placed in the correct folder (English or French, CasSys, Dela et Graphs).

To download CasEN_Istex, please accept the terms of LGPL-LR license.

The English CasEN_Istex

The download below contains:

Click here: Download CasEN_Istex_en.0.1.3 (version of July 27, 2018).

Before starting the cascade, uncheck the preprocessing and apply (Text\Apply lexical resources) and the dictionaries by default, the dictionary Prolex-Unitex and the dictionaries of the cascade.

The French CasEN_Istex

Similarly, CasEN exists for annotating a corpus in French. The TEI version can be found here. If you want the Istex version, you have to complete the TEI version with:

Important: if you use the Unitex interface, you can only use the cascade of analyse and the cascade of synthesis. Preprocessing and Standoff need a script to run them.

Click here: Download casen_Istex_fr_1_1 complement for casen_fr_1_1 (version of April 06, 2018).

Annotation guide of Named Entities Istex Project

To download the PDF file you have to accept the Creative Commons CC-BY license.

Licence CC-BY

Click here: Download the Annotation guide of Named Entities Istex Project PDF file (version of March 03, 2016).

How to Cite Us

Friburger N., Maurel D. (2004), Finite-state transducer cascade to extract named entities in texts, Theoretical Computer Science, vol. 313, 94-104.

Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella I., Nouvel D. (2011), Cascades autour de la reconnaissance des entités nommées, TAL 52-1.