BanglaLM: Bangla Corpus For Language Model Research

This dataset consists of three parts:

Raw data
Preprocessed V1
Preprocessed V2

Link of the dataset

Kaggle: BanglaLM: Bangla Corpus For Language Model Research

Details of the dataset

We have collected text data which is a string of various lengths. The total volume of the data is 14 Gigabytes. We have collected data from various websites, including newspapers, social networks, blog sites, Wikipedia, etc. The newspaper websites include Prothom Alo, BD news, Jugantor, Jaijaidin, and so on. We have collected raw data using python script and done necessary preprocessing at the time of saving the data into local memory. Then we have gone through some more preprocessing steps that have been described later in the preprocessing section of this article. We have, in the meantime, started to build some models based on this data and the primary findings are satisfactory, and it ensures the quality of the dataset.There are a total of 19132010 observations in our dataset. We are releasing three versions of this dataset including, (i) Raw data, (ii) Preprocessed V1, and (iii) Preprocessed V2. The raw data can be preprocessed according to the demand of any particular project, and Preprocessed V1 is for LSTM based machine learning model, and Preprocessed V2 is better suitable for a statistical model. This dataset can also be manually labeled to be used for supervised learning. Fig.1 below shows the screen copy of the dataset view using pandas data frame. We can see the index and text from the table. Each of the indexes indicates a particular entry, and in the right column, we can read the value of the text. We can see the raw data along with Preprocessed V2 word count in Fig.2, Fig.3 correspondingly.

Here is he Summary of Preprocessed Data V2:

The workflow of the data collection procedure is shown below in Fig. 5.

Get the data

You can use direct links to download the dataset.

Name	Size	Link (Compressed ZIP)
`Raw data`	13.27 GB	Download
`Preprocessed V1`	13.22 GB	Download
`Preprocessed V2`	12.89 GB	Download

Usage

A bert-base-bangla (Transformer based Masked language model) has been developed using the dataset

This dataset has been used to train the pretrained model Bangla FastText Model & Toolkit

To install the latest release, you have to do:

!pip install BanglaFastText

For further information and introduction you can visit this Github repo: Bangla FastText Model & Toolkit

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution 4.0 International License. Copyright of the dataset contents belongs to the original copyright holders.

Cite this dataset👍

@inproceedings{kowsher-etal-2021-banglalm,
    title = "BanglaLM: Bangla Corpus for Language Model Research",
    author ="Kowsher, Md. and
     Uddin, Md.Jashim and
     Tahabilder, Anik and
     Ruhul Amin, Md and
     Shahriar, Md. Fahim and 
     Sobuj, Md. Shohanur Islam
     ",
      
    conference = "International conference on inventive research in computing applications (ICIRCA)",
    month = "September",
    year = "2021",
    address = "Online",
    publisher = "IEEE",
    url = "https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3882903"
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
README.md		README.md
README.txt		README.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BanglaLM: Bangla Corpus For Language Model Research

Link of the dataset

Details of the dataset

Get the data

Usage

License

About

Releases

Packages

Contributors 3

Kowsher/BanglaLM-Dataset

Folders and files

Latest commit

History

Repository files navigation

BanglaLM: Bangla Corpus For Language Model Research

Link of the dataset

Details of the dataset

Get the data

Usage

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages