Skip to content

nilc-nlp/CORAA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

CORAA ASR - v1.1

CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 290.77 hours of audios and their respective transcriptions (400k+ segmented audios). The dataset is composed of audios of 5 original projects:

  • ALIP (Gonçalves, 2019)
  • C-ORAL Brazil (Raso and Mello, 2012)
  • NURC-Recife (Oliviera Jr., 2016)
  • SP-2010 (Mendes and Oushiro, 2012)
  • TEDx talks (talks in Portuguese)

The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.

Metadata

  • file_path: the path to an audio file
  • task: transcription (annotators revised original transcriptions); annotation (annotators classified the audio-transcription pair according to votes_for_* metrics); annotation_and_transcription (both tasks were performed)
  • variety: European Portuguese (PT_PT) or Brazilian Portuguese (PT_BR)
  • dataset: one of five datasets (ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese)
  • accent: one of four accents (Minas Gerais, Recife, Sao Paulo cities, Sao Paulo capital) or the value "miscellaneous"
  • speech_genre: Interviews, Dialogues, Monologues, Conversations, Interviews, Conference, Class Talks, Stage Talks or Reading
  • speech_style: Spontaneous Speech or Prepared Speech or Read Speech
  • up_votes: for annotation, the number of votes to valid the audio (most audios were revewed by one annotor, but some of the audios were analyzed by more than one).
  • down_votes: for annotation, the number of votes do invalid the audio (always smaller than up_votes)
  • votes_for_hesitation: for annotation, votes categorizing the audio as having the hesitation phenomenon
  • votes_for_filled_pause: for annotation, votes categorizing the audio as having the filled pause phenomenon
  • votes_for_noise_or_low_voice: for annotation, votes categorizing the audio as either having noise or low voice, without impairing the audio compression.
  • votes_for_second_voice: for annotation, votes categorizing the audio as having a second voice, without impairing the audio compression
  • votes_for_no_identified_problem: without impairing the audio as having no identified phenomenon (of the four described above)
  • text: the transcription for the audio

Downloads :

Dataset:

Gdrive Internal Hugging Face
Train audios Train audios Train audios
Train transcriptions and metadata Train transcriptions and metadata Train transcriptions and metadata
Dev audios Dev audios Dev audios
Dev transcriptions and metadata Dev transcriptions and metadata Dev transcriptions and metadata
Test audios Test audios Test audios
Test transcriptions and metadata Test transcriptions and metadata Test transcriptions and metadata

No link a seguir contém áudios em RAW (sem anotação), separados dos demais áudios disponíveis para download: https://zenodo.org/record/6794924#.YsXWMEjMJkg

Experiments:

Model trained in this corpus: Wav2Vec 2.0 XLSR-53 (multilingual pretraining)

Citation

  • Full Paper:
@article{candido2022coraa,
  title={CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese},
  author={Candido Junior, Arnaldo and Casanova, Edresson and Soares, Anderson and de Oliveira, Frederico Santos and Oliveira, Lucas and Junior, Ricardo Corso Fernandes and da Silva, Daniel Peixoto Pinto and Fayet, Fernando Gorgulho and Carlotto, Bruno Baldissera and Gris, Lucas Rafael Stefanel and others},
  journal={Language Resources and Evaluation},
  pages={1--33},
  year={2022},
  publisher={Springer}
}

Partners / Sponsors / Funding

References

  • Gonçalves SCL (2019) Projeto ALIP (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do Português Brasileiro. Revista Estudos Linguísticos 48(1):276–297.
  • Raso T, Mello H, Mittmann MM (2012) The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp 106–113, URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/624_Paper.pdf
  • Oliviera Jr M (2016) Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (NURC). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Linguísticos 3(2):149–174, URL https://revistas.uam.es/chimera/article/view/6519
  • Mendes RB, Oushiro L (2012) Mapping Paulistano Portuguese: the SP2010 Project. In: Proceedings of the VIIth GSCP International Conference: Speech and Corpora, Fizenze University Press, Firenze, Italy, pp 459–463.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published