📚 Dataset Guidelines

🏷️ Minimum metadata

  • Speaker ID (anonymized)
  • Approximate age band
  • Gender (optional/self-declared)
  • Dialect/region
  • Recording environment and device class

🎧 Audio quality basics

  • Prefer 16kHz+ clean speech
  • Avoid clipping and heavy background noise
  • Keep transcript aligned with spoken content

✍️ Text policy

  • Use agreed normalization rules
  • Keep punctuation consistent
  • Track alternate spellings in glossary