Surface Text Understanding for the Efficient Indexing and Information Extraction from Financial Documents


Surface Text Understanding for the Efficient Indexing and Information Extraction from Financial Documents
The project aimed at the construction of a modular system integrating NLP tools in a pipeline, which performed text analysis for the production of a semantic representation suitable for template filling in scenario based information extraction (IE) applications. This surface semantic representation incorporated linguistic information and can be used for the efficient document indexing - the latter making use not only of keyword frequencies but also of complex (multilevel) processing of the user's query.
Processing was performed through a set of pipelined standalone modules, at the following levels:
  1. Part-of speech tagging and lemmatization
  2. recognition and classification of Named Entities
  3. surface syntactic analysis and functional relations assignment
  4. coreference resolution
Initially, part-of-speech (POS) tagging and lemmatisation was performed. We made use of a PAROLE compatible tagset with ~670 different part-of-speech tags, which captured the morphosyntactic particularities of the Greek language.
At the next stage, recognition and classification of Named Entities followed. Names (persons, organisations, locations, etc.) constitute the heads possibly filling the thematic roles of the events described in a text. The project developed a novel system which recognizes and classifies Named Entities of the following types:
  • companies and organizations,
  • persons,
  • locations,
  • date and time expressions,
  • monetary and percentage expressions.
Surface syntactic analysis was performed according to EAGLES specifications. The analysis at this level aimed at the recognition and classification of clauses and greek phrase structure. The following constituents are labelled: noun phrases, adjective phrases, prepositional phrases, adverb and verb phrases, as well as their heads and modifiers. After all phrase and clause labels have been unambiguously assigned, overt and null subjects as well as objects are recognized by means of a PAROLE compatible subcategorization frames lexicon.
The last stage in the pipeline was a system for coreference resolution dealing with proper names, noun phrases, and pronouns. Resolution was performed in two steps: at the first one markables are identified and characterized, whereas, at the second one, markables which co-refer were linked together. Coreference resolution was based on the evaluation of grammatical and syntactic features accomplished by a sub-system coupled with rules.
Start Date
End Date
Stelios Piperidis