Speech-to-Text Data Collection

Table of content

Introduction
Pipeline
Architecture
Project Structure
Installation

Introduction

The purpose of this project is to build a data engineering pipeline that allows recording millions of Amharic and Swahili speakers reading digital texts in-app and web platforms. For this project, the Amharic news text classification dataset with baseline performance dataset is used. The aim of this project is to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Pipeline

This is our pipeline of this project that will be used to record millions of Amharic and Swahili speakers reading digital texts in-app and web platforms.

Project Structure

There are several files in the repository, including Python scripts, Jupyter notebooks, and text files.

Installation

git clone https://github.com/STT-Data-Engineering/Speech_to_text

Contributors

Made with contrib.rocks

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.dvc		.dvc
.github/workflows		.github/workflows
airflow		airflow
airflow_dags		airflow_dags
data		data
frontend		frontend
images		images
notebooks		notebooks
reports		reports
scripts		scripts
server		server
tests.py		tests.py
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.kafka.multi-broker.yml		docker-compose.kafka.multi-broker.yml
docker-compose.kafka.yml		docker-compose.kafka.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-to-Text Data Collection

Introduction

Pipeline

Project Structure

Installation

Contributors

About

Releases

Packages

Contributors 7

Languages

License

STT-Data-Engineering/Speech_to_text

Folders and files

Latest commit

History

Repository files navigation

Speech-to-Text Data Collection

Introduction

Pipeline

Project Structure

Installation

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages