📚 Dataset Guidelines
📚 Dataset Guidelines
🏷️ Minimum metadata
- Speaker ID (anonymized)
- Approximate age band
- Gender (optional/self-declared)
- Dialect/region
- Recording environment and device class
🎧 Audio quality basics
- Prefer 16kHz+ clean speech
- Avoid clipping and heavy background noise
- Keep transcript aligned with spoken content
✍️ Text policy
- Use agreed normalization rules
- Keep punctuation consistent
- Track alternate spellings in glossary