Educational Greek Corpus


Educational Greek Corpus

The Educational Greek Corpus (EGC) is the extended version of the Hellenic National Corpus (HNC) version 2.0. It comprises:

  • a general corpus
  • the textbooks corpus
  • the educators' corpus

The general corpus contains more than 34,000,000 words of written texts and is a part of the Hellenic National Corpus developed by the Institute for Language and Speech Processing / R.C. "Athena".
Texts in the HNC represent modern Greek language use and most of them having been written after 1990. In order to include different types of language, texts from several media, belonging to different genres and dealing with various topics have been selected. Texts written in highly idiomatic language have been excluded from the corpus. Most texts have been selected based on their high readability (high circulation newspapers, best-selling books etc.).
The textbooks corpus contains 2.250.000 words from the textbooks used in Greek public schools.
The educators' corpus comprises texts selected and uploaded by the teachers themselves.
For all EGC texts users can have access to the following information:

  • Bibliographic data:
    • title
    • author
    • publisher
    • translator (in case the text is a translation)
    • date of publication
  • Classification data:
    • medium
    • genre and detailed genre
    • topic and detailed topic.
    • school grade and subject (only for texts in the text-books corpus)

How can one use EGC?
EGC offers the environment and the tools to:
1. Retrieve authentic examples of use of the Greek language by searching for:

  • specific word forms: by entering the word «παίζω», all sentences containing this word form are retrieved
  • lemmata: by entering the lemma «παίζω», all sentences containing every inflected form of the lemma «παίζω», such as «παίζει», «παίξω», «παίζοντας», are retrieved
  • parts of speech: by entering «noun», all sentences containing a noun are retrieved

2. Study the results of your query in concordances.
3. Look for specific word and lemma frequencies in the EGC corpora.

Start Date
End Date