Skip to content

Latest commit

 

History

History
152 lines (88 loc) · 4.17 KB

README.rst

File metadata and controls

152 lines (88 loc) · 4.17 KB

NLP for Greek language

This repo is an aggregation of sources for Greek language to tackle varios Natural Language Processing/Understanding/Generation needs.

Contents

The language and countries

Greek language is spoken by majority of population in two countries.

X Country - ISO Language code
CY https://img.shields.io/badge/Cyprus-el-green
GR https://img.shields.io/badge/Greek-el-green

Greek Tree Bank

Morphological and syntatic annotations of Greek corpus. This Greek UD source used by many other pretrained open-source components.

Manually annotated: lemmas, dependencies, POS, features.

Genres: news, wiki, spoken

Souces: public domain, wikinews articles, European Parlament sessions texts.

Corpus size: 2521 sentences/ 61.673 tokens.

https://universaldependencies.org/treebanks/el_gdt/index.html

Pipeline Components

Accentuation and diacritics

Greek text requires accents and diacritics removal. Some new Tokenizers include this step but earliest editions doesn not. https://legacy.cltk.org/en/latest/greek.html

Lemmatization

Spacy lemmatizer (trainable lemmatizer)

JohnSnowLabs Greek lemmatizer

CLTK Greek lemmatiter

Tokenization

Depends on a sutiation we might need different corpus tokenization. Sources below include general tokenizers for word, sentence, paragraph tokenization.

NLTK

NLTK tokenizer module

Spacy

Spacy Tokenizer Also available a pipeline component for Greek language senter for Sentence segmentation.

Other

Spacy offers other helpful components: morphologizer, dependency parser, attribute ruler.

NLP tasks

Named Entity Recognition

Source Supported labels Link
Spacy EVENT, GPE, LOC, ORG, PERSON, PRODUCT Spacy models
Spark NLP    
Stanza    
AUEB LOC, ORG, PERSON, gr-nlp-toolkit transformer-based

Translation

Package Details Link
Spark NLP Multilingual (wrapped from Hugging Face)  
Transformers Multilingual  

Question Answering

Cross-lingual QA dataset: XQuAD

Transformers model

BERT model pretrained on Greek corpus only.

bert-base-greek-uncased-v1

Greek BERT

Other

Proper nouns

List of 144,000 Classical Greek proper nouns

Ancient Greek

Some handy stuff for Ancient Greek