Skip to content

HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).

Notifications You must be signed in to change notification settings

franciellevargas/HausaHate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

DOI

SSC-logo-300x171

HausaHate: A Benchmark Dataset for Hausa Hate Speech Detection


In African countries, the hate speech phenomenon is especially serious due to a historical problem regarding ethnic conflicts. Specifically, the Western region still lacks more research on hate speech focusing on its indigenous languages. Moreover, as most of the existing hate speech data resources are developed for the English language, the research and development of hate speech technologies for African indigenous languages are less developed. To fill this relevant gap, we introduce the first expert annotated corpus of Facebook comments for Hausa hate speech detection. The corpus titled HausaHate comprises 2,000 comments extracted from Western African Facebook pages and manually annotated by three Hausa native speakers, who are also NLP experts. Our corpus was annotated using two different layers. We first labeled each comment according to a binary classification: offensive versus non-offensive. Then, offensive comments were also labeled according to hate speech targets: race, gender and none. Lastly, a baseline model using fine-tuned LLM for Hausa hate speech detection is presented, highlighting the challenges of hate speech detection tasks for indigenous languages in Africa, as well as future advances. The following table describes in detail the HausaHate categories and documents:



Offensive Non-Offensive Total Comments
678 1,322 2,000

Race Gender Non-Target Total
391 65 222 678

What the following is the list of collaborators and authors this project:


ETHICS STATEMENT

We followed the steps to anonymize the data described in Section 4.2.3 in the paper, as it is standard for papers with this kind of data. There is a public corpus of anonymized Facebook comments available. However, since the last change on the Meta platform terms of service was in 2020, we only decided to disclose the ids of the comments (only when requested) in order to allow the reproducibility, while also compelling researchers to pass through Meta’s authorization procedures to access the full data. Note that in order to keep the data anonymization, we publically provide the comments without their ids and links. Hence, please, contact [email protected] to request the corpus with ids and links of the comments.

CITING

Vargas, F., Guimarães, S., Muhammad, H. S., Alves, D., Ahmad, I. S., Abdulmumin, I., Mohamed, D., Pardo, T.A.S., Benevenuto, F. (2024). HausaHate: An Expert Annotated Corpus for Hausa Hate Speech Detection. Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH @ NAACL 2024). pp.52--58. Mexico City, Mexico. https://aclanthology.org/2024.woah-1.5.


BIBTEX

@inproceedings{vargas-etal-2024-hausahate, title = "{H}ausa{H}ate: An Expert Annotated Corpus for {H}ausa Hate Speech Detection", author = "Vargas, Francielle and Guimar{\~a}es, Samuel and Muhammad, Shamsuddeen Hassan and Alves, Diego and Ahmad, Ibrahim Said and Abdulmumin, Idris and Mohamed, Diallo and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio", editor = {Chung, Yi-Ling and Talat, Zeerak and Nozza, Debora and Plaza-del-Arco, Flor Miriam and R{\"o}ttger, Paul and Mostafazadeh Davani, Aida and Calabrese, Agostina}, booktitle = "Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.woah-1.5", pages = "52--58", }


FUNDING

SSC-logo-300x171 SSC-logo-300x171 SSC-logo-300x171 SSC-logo-300x171 SSC-logo-300x171


About

HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).

Topics

Resources

Stars

Watchers

Forks

Packages