Cascade of transducer CasEN
for Named Entity Recognition

Laboratoire d'informatique de l'Université François-Rabelais de Tours

French version

Description of CasEN

CasEN is made available on the plateform Unitex as part of the projects ANR Variling, FEDER Région Centre Entités nommées et nommables, Ortolang and Istex.

The cascade CasEN recognises named entities by using lexical resources and local descriptions of patterns, transducers that act on the text by insertions, replacements or deletions. These actions can be eventually iterative. They can be used "on the fly" on a particular text based on the results of previous transducers. The plateform Unitex allows easy creation and maintenance of these transducers by presenting them to the user in form of graphs. The aim of a casade it to utilise the patterns already identified, or, on the contrary, to avoid tagging a pattern already recognised. Thus, the order in which the transducers are passed is an important parameter.

The graphs subsequently call subgraphs that are:

Graphs can be constructed automatically for the text under consideration from generic graphs. These graphs permit retrieval of an entity without a local context, provided this entity had been identified elsewhere in the text by one of the previous graphs.

An Example of Tagging with CasEN

The sentence:

Prince John, in the meanwhile, occupied his castle, and disposed of his domains without scruple;

extracted from the corpus distributed by Unitex (Ivanhoe, by Sir Walter Scott) is transformed by:

to give (file ivanhoe_snt.raw):

Prince {\{John\,\.name\+last\+grftagLastName\},.entity+pers+ind+grfpersProfession}, in the meanwhile, occupied his castle, and disposed of his domains without scruple;

This format enables the display of concordance, but is hardly readible. Therefore, another resultant file is available in a XML-CasSys (file ivanhoe_snt.txt). This example is:

Prince
<csc>
   <form>
     <csc>
        <form>John</form>
        <code>name</code>
        <code>last</code>
        <code>grftagLastName</code>
     </csc>
   </form>
   <code>entity</code>
   <code>pers</code>
   <code>ind</code>
   <code>grfpersProfession</code>
</csc>,
in the meanwhile, occupied his castle, and disposed of his domains without scruple;

A recognised sequence is, on one hand, tagged and, on the other hand, frozen in a polylexical expression. This annotation can be later serached in Unitex by more or less specific masks. For example, from the graph above, <entity>, <pers> or <ind>. To enable the debugging, we add the name of the graph that had inserted it, prefixed by grf, here grfpersProfession.

If the output XML-CasSys does not correspond to the desired annotation (which is generally the case), the file _csc.txt can be opened in Unitex and treated with a second cascade. Hence, CasEN is composed of two cascades, one for analysis and the second one for synthesis. For our example, and for the synthesis version Istex, the result of the second cascade is:

Prince <persName>John</persName>, in the meanwhile, occupied his castle, and disposed of his domains without scruple;

The Order of Graphs

The cascade itself contains blocks of affirmations which are possible to be retrieved... For example, the sentence:

He arrived on 29 February 2008.

can be analysed by several graphs of CasEN:

One must apply the graph timeAbsoluteCalendarDateYear before thee other.

Sometimes, it is not about occurence, but about complement. The most simple example is without a doubt the graph of postal addresses that has patterns of person (to identify Franklin D. Roosevelt Drive): the graphs of persons are thus placed before the graph of addresses. Various organisations also comprise of tags of type person, such as Rockefeller Center or Lincoln Hospital. These organisations are therefore recognized after the graphs of persons. Hence, the order of graphs is important.

CasEN, version Istex

Under the projet Istex, the version Quaero of CasEN is supplemented for French by named entity recognition in scientific texts (as explained below). But this project deals also (and essentially) with texts written in English, this lead to the creation of a new cascade for this corpus and as a result can be used for other corpora in English.

The annotations of the cascade of analyse are borrowed from the TEI and the cascade of synthesis follow the Istex annotation guide. These two cascades are available below. Their evaluation, carried out in parallel with that of the French version, will be soon available.

If there is any remarks or bugs, please write to casen.bug At univ-tours.fr.

Download CasEN

Ensure that Unitex is up to date. One must work with Unitex 3.1 (stable version) or later. Download the appropriate version according to your OS (Windows versus Mac/Unix). If you unzip the file in your Unitex path, then files will be placed in the correct folder. We add three files for normalization and alphabet.

Before starting the cascade, uncheck the preprocessing and apply (Text\Apply lexical resources) and the dictionaries by default, the dictionary Prolex-Unitex and the dictionaries of the cascade.

To download CasEN, please accept the terms of LGPL-LR license.

Version Istex (for English)

The download below contains:

Click here: Download CasEN_Istex_en.0.1.2 for Windows (version of April 29, 2016).

Click here: Download CasEN_Istex_en.0.1.2 for Mac/Unix (version of April 29, 2016).

French Version

Similarly, CasEN exists for annotating a corpus in French. This can be found here.

Annotation guide of Named Entities Istex Project

To download the PDF file you have to accept the Creative Commons CC-BY license.

Licence CC-BY

Click here: Download the Annotation guide of Named Entities Istex Project PDF file (version of April 29, 2016).

How to Cite Us

Friburger N., Maurel D. (2004), Finite-state transducer cascade to extract named entities in texts, Theoretical Computer Science, vol. 313, 94-104.

Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella I., Nouvel D. (2011), Cascades autour de la reconnaissance des entités nommées, TAL 52-1.