Skip to content

Kowsher/BanglaLM-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

banglaLM

BanglaLM: Bangla Corpus For Language Model Research size: 40GB

This dataset consists of three parts:

  • Raw data
  • Preprocessed V1
  • Preprocessed V2

Link of the dataset

Kaggle: BanglaLM: Bangla Corpus For Language Model Research

Details of the dataset

We have collected text data which is a string of various lengths. The total volume of the data is 14 Gigabytes. We have collected data from various websites, including newspapers, social networks, blog sites, Wikipedia, etc. The newspaper websites include Prothom Alo, BD news, Jugantor, Jaijaidin, and so on. We have collected raw data using python script and done necessary preprocessing at the time of saving the data into local memory. Then we have gone through some more preprocessing steps that have been described later in the preprocessing section of this article. We have, in the meantime, started to build some models based on this data and the primary findings are satisfactory, and it ensures the quality of the dataset.There are a total of 19132010 observations in our dataset. We are releasing three versions of this dataset including, (i) Raw data, (ii) Preprocessed V1, and (iii) Preprocessed V2. The raw data can be preprocessed according to the demand of any particular project, and Preprocessed V1 is for LSTM based machine learning model, and Preprocessed V2 is better suitable for a statistical model. This dataset can also be manually labeled to be used for supervised learning. Fig.1 below shows the screen copy of the dataset view using pandas data frame. We can see the index and text from the table. Each of the indexes indicates a particular entry, and in the right column, we can read the value of the text. We can see the raw data along with Preprocessed V2 word count in Fig.2, Fig.3 correspondingly.

this is screen copy of data this is the raw data summary

Here is he Summary of Preprocessed Data V2:

this is summary of preprocessed data v1


The workflow of the data collection procedure is shown below in Fig. 5.

flowchart

Get the data

You can use direct links to download the dataset.

Name Size Link (Compressed ZIP)
Raw data 13.27 GB Download
Preprocessed V1 13.22 GB Download
Preprocessed V2 12.89 GB Download

Usage

A bert-base-bangla (Transformer based Masked language model) has been developed using the dataset

This dataset has been used to train the pretrained model Bangla FastText Model & Toolkit

To install the latest release, you have to do:

!pip install BanglaFastText

For further information and introduction you can visit this Github repo: Bangla FastText Model & Toolkit

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution 4.0 International License. Copyright of the dataset contents belongs to the original copyright holders.

Cite this dataset👍

@inproceedings{kowsher-etal-2021-banglalm,
    title = "BanglaLM: Bangla Corpus for Language Model Research",
    author ="Kowsher, Md. and
     Uddin, Md.Jashim and
     Tahabilder, Anik and
     Ruhul Amin, Md and
     Shahriar, Md. Fahim and 
     Sobuj, Md. Shohanur Islam
     ",
      
    conference = "International conference on inventive research in computing applications (ICIRCA)",
    month = "September",
    year = "2021",
    address = "Online",
    publisher = "IEEE",
    url = "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3882903"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages