Skip to content

English-Sinhala parallel word dictionary and resources

License

Notifications You must be signed in to change notification settings

kasunw22/sinhala-para-dict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sinhala-para-dict

English-Sinhala parallel word dictionary and resources

  • scripts/si-stop-words.txt is taken from this repository.
  • word-frequencies/all_counts.txt contains the word-frequency analysis we have done. We used STM and SinMin corpora. There are 2.16M unique words.

BibTex Citation

If you are willing to use this work, please be kind enough to cite the following paper.

@INPROCEEDINGS{Wick2308:Sinhala,
    AUTHOR="Kasun Wickramasinghe and Nisansa {de Silva}",
    TITLE="{Sinhala-English} Parallel Word Dictionary Dataset",
    BOOKTITLE="2023 IEEE 17th International Conference on Industrial and Information
    Systems (ICIIS) (ICIIS'2023)",
    ADDRESS="Peradeniya, Sri Lanka",
    PAGES=6,
    DAYS=24,
    MONTH=aug,
    YEAR=2023,
    KEYWORDS="Parallel Corpus; Alignment; English-Sinhala Dictionary; Word Embedding
    Alignment; Lexicon Induction",
    ABSTRACT="Parallel datasets are vital for performing and evaluating any kind of
    multilingual task. However, in the cases where one of the considered
    language pairs is a low-resource language, the existing top-down parallel
    data such as corpora are lacking in both tally and quality due to the
    dearth of human annotation. Therefore, for low-resource languages, it is
    more feasible to move in the bottom-up direction where finer granular pairs
    such as dictionary datasets are developed first. They may then be used for
    mid-level tasks such as supervised multilingual word embedding alignment.
    These in turn can later guide higher-level tasks in the order of aligning
    sentence or paragraph text corpora used for Machine Translation (MT). Even
    though more approachable than generating and aligning a massive corpus for
    a low-resource language, for the same reason of apathy from larger research
    entities, even these finer granular data sets are lacking for some
    low-resource languages. We have observed that there is no free and open
    dictionary data set for the low-resource language, Sinhala. Thus, in this
    work, we introduce three parallel English-Sinhala word dictionaries
    (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in
    multilingual Natural Language Processing (NLP) tasks related to English and
    Sinhala languages. In this paper, we explain the dataset creation pipeline
    as well as the experimental results of the tests we have carried out to
    verify the quality of the data sets. The data sets and the related scripts
    are available at https://github.com/kasunw22/sinhala-para-dict/tree/main."
}