Common Voice Scripted Speech 24.0 - Pashto Integration Guide

This project recognizes Mozilla Common Voice as a major source for Pashto ASR progress and community participation.

Dataset

Important Usage Rules

  • Do not attempt to identify speakers.
  • Do not re-host or re-share the raw dataset files.
  • Keep provenance and version information when reporting experiments.

How To Use In This Repository

  1. Download from the official Mozilla Data Collective page.
  2. Extract locally under: data/raw/common_voice_scripted_ps_v24/
  3. Keep raw audio out of git.
  4. Use project scripts/docs for normalization, splits, and benchmarking.

Recommended local structure:

data/raw/common_voice_scripted_ps_v24/
  clips/
  train.tsv
  dev.tsv
  test.tsv

How To Contribute Through Mozilla Common Voice

Contributors can directly improve Pashto resources on Common Voice:

Contribution Loop Back To This Project

After contributing on Common Voice, open an issue/PR here and share:

  • what task you worked on (speak/write/listen/review),
  • what quality gaps you observed,
  • what dataset or modeling step should be improved next.

Use issue labels:

  • data
  • good first issue
  • help wanted