An AI feature PRD template must include sections that standard product templates omit entirely: model selection rationale, confidence thresholds and what happens below them, fallback behavior when the model fails, evaluation metrics beyond accuracy, bias and fairness considerations, and data pipeline requirements. This template covers all eight AI-specific sections in addition to the standard 10-section PRD structure.
Why AI features need their own PRD sections
An AI feature has failure modes that don't exist in deterministic software. A traditional feature either works or doesn't. An AI feature can work in aggregate (90% accuracy) while failing systematically for a specific user segment, or degrade silently as the input distribution shifts. None of these failure modes appear in standard PRD templates because they didn't exist before AI features were common.
The AI feature PRD template
Standard sections (1–10)
Use the standard 10-section PRD template as the base. The sections below are additive — add them after section 10 (acceptance criteria).
11. Model selection and rationale
Model: [e.g., Claude claude-sonnet-4-6 / GPT-4o / fine-tuned Llama 3 / internal model]
Why this model: [Cost per inference, latency requirement, context window size, output format control, privacy constraints (on-premise vs. API), benchmark performance on your specific task type]
Model owner: [External API: Anthropic/OpenAI/etc. Internal: ML team contact]
Model version pinning: [Will you pin to a specific model version or track latest? State the policy. Pinned = stable but misses improvements. Latest = may break prompts.]
12. Confidence thresholds and degraded states
Confidence definition: [How is model confidence measured for this feature? Token probabilities, self-consistency across N samples, external classifier, or heuristic?]
High confidence (above threshold): Present AI output directly. No additional indication of uncertainty needed.
Medium confidence (defined range): Present output with a confidence indicator ("AI suggests — verify before using"). Log for review.
Low confidence (below threshold): Do not surface AI output. Fall back to [rule-based alternative / manual input / empty state with human prompt]. Never show a low-confidence AI output as if it were a fact.
Threshold values: [State specific values after evaluation. Do not ship without defining thresholds — "we'll figure it out in staging" is not a plan.]
13. Fallback behavior
Model API unavailable: [Fallback to cached output / show graceful error / disable feature with user notification]
Latency timeout (above SLA): [Show partial output / cancel and show retry / use cached result]
Unexpected output format: [Retry with clarification prompt once / log and surface error / use fallback logic]
Cost spike detection: [Circuit breaker if inference cost exceeds $X/hour? State the threshold and the automated action.]
14. Evaluation metrics
Accuracy alone is insufficient. Define how you will measure AI feature quality across multiple dimensions before launch.
Metric Definition Minimum acceptable Target Task success rate % of outputs the user accepts without editing 65% 80% Edit distance Average edits made to AI output before use Under 15 words Under 8 words Hallucination rate % of outputs containing factually incorrect statements (human eval) Under 3% Under 1% Latency (p95) Time from request to complete output Under 8 seconds Under 4 seconds User thumbs-down rate Explicit negative feedback / total uses Under 10% Under 4%
15. Bias and fairness considerations
User groups at risk of differential performance: [Non-English language inputs, names from underrepresented groups, industry-specific jargon not in training data, etc.]
Evaluation plan: [Test accuracy on stratified samples across demographic groups before launch. State who owns this evaluation and when it must be completed.]
Disclosure: [Will users be informed that AI generated this output? State the disclosure requirement and where it appears in the UI.]
16. Data pipeline requirements
Input data: [Source, format, volume, PII handling. If user data is sent to an external model API, state what data, what the API provider's data retention policy is, and whether a DPA is in place.]
Training / fine-tuning data: [Not applicable for v1 / Required: specify source, labeling process, and data governance]
Feedback loop: [How will thumbs-up/thumbs-down data be collected, stored, and used to improve the model over time?]
Data retention: [How long are AI inputs and outputs stored? Who can access them? Under what circumstances?]
17. Prompt engineering documentation
System prompt: [The exact system prompt used. Version-controlled. Changes require PM sign-off because prompt changes are effectively feature changes.]
Prompt variables: [Which parts of the prompt are dynamic? What injects into them? What is the max token budget for injected content?]
Prompt change process: [Prompt changes require: eval on golden dataset, review from PM + ML lead, staged rollout with monitoring. Do not treat prompt changes as "just text edits."]
18. Cost model
Cost per inference: [Estimated tokens in × tokens out × model pricing = cost per request]
Monthly cost estimate: [At current DAU × requests/user/day × cost per inference]
Cost guard: [Maximum monthly inference cost before feature is automatically throttled or disabled. Who owns the budget alert?]