Natural Language Processing

Language is both the main medium of human communication and the main instrument for creating content, representing knowledge, describing and interpreting data and information. The ability to manage and process the huge volumes of content, data and information flowing around our digital universe in one or more languages, renders language technology a key enabling technology. In our concurrent quest for the meaning of data, information and content and our indispensable requirements for advanced innovative applications, language technology and in particular natural language processing is essential. Likewise, cracking the language barriers and facilitating multilingual online communication is a central theme in the international research agenda and underlines the necessity of affordable technologies and applications that enable communication and collaboration across languages, secure language users equal access to the information society, and support each language in the advanced functionalities of networked ICT. For the related applications to work, at least two strands of technological development are pursued: (a) technologies for language, information and knowledge processing and management, and (b) multilingual and cross-lingual information processing, notably machine translation and machine-assisted human translation.

The unprecedented abundance of available information has undeniably foreshadowed the fast growth of technologies that would support the required information access to unstructured data.  Large collections of text need to be analyzed, annotated and organized in order to be searchable and retrievable by machines. A first challenge is to identify the major topics being discussed in the collection. Unsupervised and supervised statistical topic models have been extensively applied [Blei and Lafferty, 2009; Gimpel 2006; J. Boyd-Graber et al. 2010; Griffiths et al., 2005] in order to reveal the underlying semantic structure of texts. Topic models serve a variety of applications including classification, summarization, topic and trends tracking, language modeling adaptation, etc. Informally, a topic expresses “what a document is about” and in this sense, topics refer to events or facts (specifying “who did what to whom and how, when and where”). Fact/Event extraction is employed in various applications such as information extraction, automatic summarization and question-answering, etc. Facts are usually studied in terms of either their semantic structure (participants or arguments). These are mainly defined on the ultimate goal any application has set and their definition is usually domain dependent. Recently several methodologies tend to exploit collections of texts annotated in terms of events and their participants. These approaches are based on a) semantic frames following different semantic representations, b) semantic roles defined independently of frames and c) on semantic roles concerning specific event types. Recognition of events/facts also involves the recognition of factuality that refers to the degree of veracity of an assertion. FactBank [Sauri and Pustejovsky, 2009] is a corpus that has been developed on top of TimeBank bearing factuality annotation of the clauses.  Additionally, [Prasad et al., 2006] describe an annotation scheme for propositions in the Penn Discourse Treebank designed to capture, among other things, the degree of factuality of the events.

Recent advances in syntactic parsing (CONLL Shared task 2008) and coreference resolution for named entities (Coreference Resolution in Multiple Languages, SemEval 2010) have significantly contributed to successfully mapping syntactic representations to underlying argument roles. In this context, dependency grammars is one of the most promising and rapidly growing paradigms in syntactic analysis because dependency structures both extend naturally to semantic representations and are better suited to languages with a free or flexible word order like Greek. This has led to the emergence of a large number of data-driven dependency parsers for diverse languages [Kübler et al., 2009].

Document collections reflect the emotions and opinions pertaining to topics and entities. Analysis of sentiments and affects is a valuable component for applications such as opinion mining, computational advertising, customer feedback, reputation management and in general social computing apps. Sentiment analysis is referred to as the automatic identification and assessment of users’ written subjective expressions. The fundamental technology of many sentiment analysis systems is classification performed at various levels of granularity, from whole documents (in order to determine global sentiment related properties to target sentences as units. Other approaches focusing on single propositions or phrase level classification try to identify multiple subjective concepts and/or their sources/holders and targets. In most studies the problem is formulated on two opposing classes (positive-negative, subjective-objective), where machine learning techniques apply.

On the multilingual information processing front, considerable progress in statistical machine translation (SMT) during the last two decades has substantially lowered entry barriers to MT technology (Brown et al., 1993, Koehn, 2010). Still, however, two main classes of problems seem to seriously delay progress:

  • absence of appropriate parallel data for many languages, different domains and text types, for both training and evaluation as well as accurate methods for their discovery and classification;
  • inability to handle complex linguistic phenomena as well as treat gaps (lexical and syntactic) in training data.

Progress in RBMT is hindered mainly by inadequate grammar resources for most languages and absence of appropriate lexical resources and methods that would enable correct disambiguation and lexical choice. A third MT paradigm is Example-based MT –EBMT (Gough et al., 2004 & Hutchins, 2005), which is based on having a set of known pairs of input sentence (in TL) and corresponding translation (in TL) and translations are generated by analogy. Higher translation quality is expected from the emerging hybrid MT paradigms (HMT) that combine principles from more than one MT paradigms in order to achieve a higher translation quality (Eisele et al., 2008; Wu, 2009; Jussa-Costa et al., 2013). Promising areas of research along this line are:

  • Exploitation of hidden parallel data
  • Exploitation of knowledge technology methods in MT
  • Application of machine learning techniques for optimising hybrid approaches
  • Automatic extraction of linguistic information from large corpora using techniques inspired from biological systems and/or computational intelligence methods (e.g. Detorakis et al., 2010, Sofianopoulos et al., 2010, Tsimboukakis et al., 2011)

Integration of linguistic knowledge, at varying levels, in SMT, statistical layers through language modelling in RBMT and combination of different MT paradigms seem to be shaping the current trends of machine translation technology development.

Collaboration and intelligent crowd-sourcing are also proposed as the modus operandi in the future, ensuring the availability of high-quality and trustable translated content, that could be deployed either as retrievable examples from translation memory databases or as training material for language-aware or language-agnostic SMT systems. Likewise, simple, bottom-up and per domain language resources building is advocated as a solution to fill in existing gaps. Collaborative development is also investigated, drawing on the influential example of Wikipedia and Wiktionary, [Chamberlain et al., 2013] in order to develop linguistic ontologies that are widely used in NLP