Skip to content

Latest commit

 

History

History
46 lines (31 loc) · 1.42 KB

README.md

File metadata and controls

46 lines (31 loc) · 1.42 KB

faroese-corpus

Faroese corpus taken from Wikipedia dumps.

This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.

pipenv

This project uses pipenv. How to install pipenv.

Dependencies

In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive:

pipenv install
sudo apt install libarchive-dev

Links

Scripts

Run pipenv shell before running them.

words_from_dump.py

Shows the longest words taken from the dump:

1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29