Pashto Normalization Policy v0.1

This starter policy defines simple, low-risk rules for text cleanup before training ASR/TTS/NLP baselines.

Scope

Trim leading and trailing whitespace.
Collapse repeated internal spaces to a single space.
Remove zero-width/invisible spacing characters.
Remove elongation characters such as tatweel (ـ).
Use Arabic punctuation consistently in Pashto text:
- comma: ،
- question mark: ؟
- semicolon: ؛
Keep sentence-final punctuation as a single character (avoid !!, ؟؟).
Normalize quotation usage to one style per sentence (avoid mixed quote styles).
Normalize digit style to one standard per dataset split.
Preserve original word order and meaning; do not rewrite content.
Keep dialect wording as spoken; normalize form, not dialect identity.