Skip to content

Yuliya-HV/nlp-greek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 

Repository files navigation

NLP for Greek language

This repo is an aggregation of sources for Greek language to tackle varios Natural Language Processing/Understanding/Generation needs.

Contents

The language and countries

Greek language is spoken by majority of population in two countries.

X Country - ISO Language code
CY https://img.shields.io/badge/Cyprus-el-green
GR https://img.shields.io/badge/Greek-el-green

Greek Tree Bank

Morphological and syntatic annotations of Greek corpus. This Greek UD source used by many other pretrained open-source components.

Manually annotated: lemmas, dependencies, POS, features.

Genres: news, wiki, spoken

Souces: public domain, wikinews articles, European Parlament sessions texts.

Corpus size: 2521 sentences/ 61.673 tokens.

https://universaldependencies.org/treebanks/el_gdt/index.html

Pipeline Components

Accentuation and diacritics

Greek text requires accents and diacritics removal. Some new Tokenizers include this step but earliest editions doesn not. https://legacy.cltk.org/en/latest/greek.html

Lemmatization

Spacy lemmatizer (trainable lemmatizer)

JohnSnowLabs Greek lemmatizer

CLTK Greek lemmatiter

Tokenization

Depends on a sutiation we might need different corpus tokenization. Sources below include general tokenizers for word, sentence, paragraph tokenization.

NLTK

NLTK tokenizer module

Spacy

Spacy Tokenizer Also available a pipeline component for Greek language senter for Sentence segmentation.

Other

Spacy offers other helpful components: morphologizer, dependency parser, attribute ruler.

NLP tasks

Named Entity Recognition

Source Supported labels Link
Spacy EVENT, GPE, LOC, ORG, PERSON, PRODUCT Spacy models
Spark NLP    
Stanza    
AUEB LOC, ORG, PERSON, gr-nlp-toolkit transformer-based

Translation

Package Details Link
Spark NLP Multilingual (wrapped from Hugging Face)  
Transformers Multilingual  

Question Answering

Cross-lingual QA dataset: XQuAD

Transformers model

BERT model pretrained on Greek corpus only.

bert-base-greek-uncased-v1

Greek BERT

Other

Proper nouns

List of 144,000 Classical Greek proper nouns

Ancient Greek

Some handy stuff for Ancient Greek

About

Repo for Greek language sources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published