Pashto Normalization Policy v0.1
Pashto Normalization Policy v0.1
This starter policy defines simple, low-risk rules for text cleanup before training ASR/TTS/NLP baselines.
Scope
- Applies to sentence-level text in this repository.
- Prioritizes consistency over linguistic completeness.
- Keeps semantic meaning unchanged.
Rules
- Trim leading and trailing whitespace.
- Collapse repeated internal spaces to a single space.
- Remove zero-width/invisible spacing characters.
- Remove elongation characters such as tatweel (
ـ). - Use Arabic punctuation consistently in Pashto text:
- comma:
، - question mark:
؟ - semicolon:
؛
- comma:
- Keep sentence-final punctuation as a single character (avoid
!!,؟؟). - Normalize quotation usage to one style per sentence (avoid mixed quote styles).
- Normalize digit style to one standard per dataset split.
- Preserve original word order and meaning; do not rewrite content.
- Keep dialect wording as spoken; normalize form, not dialect identity.
Non-goals (for v0.1)
- No stemming or morphology rules.
- No automatic transliteration.
- No named-entity rewriting.
File Reference
- Seed examples: data/processed/normalization_seed_v0.1.tsv
- Validator: scripts/validate_normalization.py