PRD-100·May 14, 2026

Notion Workers

Executive Brief

Builders creating Notion Worker skills face a blank-page problem: manually defining input schemas, output structures, trigger conditions, and error handling for each new agent skill. This stalls development, with builders spending 4.3 hours per skill spec on average (n=112 skill creation logs, Q2 2024) — time that could deploy 3-4 additional skills monthly. At 220 active builders creating 1.2 skills/month and a blended engineering cost of $92/hour, this wastes $104,544 monthly in lost productivity.

Business case: 220 builders × 1.2 skills/month × 4.3 hours/skill × $92/hour × 12 months = $1.25M/year recoverable (source: Notion Workers user analytics, engineering compensation benchmarks). If adoption reaches 40% of target: $500K/year. This feature automates skill spec generation via natural language input, cutting spec creation time by ≥70%. It is a deterministic generator for structured skill definitions; it is not a runtime agent, code generator, or infrastructure manager.

Without this, builders face mounting opportunity costs — competitors like Zapier’s AI Builder and Microsoft Copilot Studio already automate workflow creation, risking user migration. By treating skill creation like installation (one command → deployable artifact), we capture back cognitive bandwidth for high-value agent design.

Success Metrics

Primary Metrics

Metric	Baseline	Target (D90)	Kill Threshold	Method
Spec creation time	4.3 hrs	≤0.75 hrs	>1.5 hrs	Workflow telemetry
Skill abandonment rate	28%	≤10%	>22%	Registry submission logs
Schema rework rate	1.8/skill	≤0.2/skill	>0.8/skill	Support tickets

Guardrail Metrics

Guardrail	Threshold	Action if Breached
Runtime errors from generated specs	<0.1% of invocations	Disable generator; require manual review
Builder trust score	≥4.2/5.0	Interview cohort; fix top 3 friction points

What We Are NOT Measuring

Total skills generated (vanity metric; doesn’t reflect quality or adoption).
Raw prompt count (measures activity, not reduced cognitive load).
First-pass approval rate (could incentivize under-explanation).

Open Questions

Strategic Decisions Log
Decision: Should generated specs allow direct editing?
Choice: Edit-locked output during review — edits require new generation cycle.
Rationale: Prevents inconsistent state; forces intent refinement at source. Rejected: Inline editing creates versioning nightmares.

Decision: How to handle proprietary skill patterns?
Choice: Never train on customer-generated specs without explicit opt-in.
Rationale: Maintain trust boundary. Rejected: "Anonymized" collection still risks IP leakage.

Decision: Fallback for unsupported triggers?
Choice: Default to manual trigger + warning: "This may require custom code".
Rationale: Safer than misconfiguring webhooks. Rejected: Blocking deployment caused user frustration in tests.

Pre-Mortem
It is 6 months from now and this feature has failed. The 3 most likely reasons are:

Builders treated generated specs as "final" without review, causing production incidents that eroded trust.
We prioritized edge-case coverage over core use cases, making the generator feel slower than manual creation.
Zapier launched a free "Import from Notion" tool the same week, capturing our builder base.

Success looks like: Builders describe skills during coffee breaks. Support tickets for schema errors drop by 75%. The CEO cites it as "how AI

Model Goals & KPIs

Quantified Baseline

Metric	Measured Baseline
Skill spec creation time	4.3 hours/skill (n=112, user telemetry)
Skill deployment abandonment rate	28% at spec stage (n=87 surveyed)
Manual schema errors requiring rework	1.8 incidents/skill (support ticket analysis)
Recoverable value: 220 builders × 1.2 skills/month × 4.3 hrs × $92/hr × 12 = $1.25M/year.

Core Objectives

Generate full skill specs (I/O schema, triggers, error handling) from ≤100-word natural language prompts with ≥95% structural validity.
Reduce spec creation time to ≤15 minutes for 90% of common use cases (DB sync, content moderation, data enrichment).
Output registry-ready READMEs requiring ≤2 edits for clarity.

Evaluation Framework

P0 Dimensions (Launch-Blocking)

Input schema validity: 100% compliance with Notion’s JSON Schema spec (zero invalid skill deployments).
Dependency mapping: 100% accuracy in identifying required Notion DB properties.

P1 Dimensions

Output structure correctness: ≥99.5% valid YAML syntax across generated specs.
Trigger condition accuracy: ≥98% match between described intent and generated trigger logic (webhook/scheduled/manual).

P2 Dimensions

Readme clarity: ≥95% of users rate auto-generated docs "usable with ≤2 edits" (5-point Likert scale).
Fallback logic coverage: ≥90% of skills include context-appropriate error handling (retry/notify/abort).

Validation Protocol

Test set: 250 real-world skill descriptions from community forum posts.
Failure mode: If schema validity falls below 100% in staging, revert to human-in-the-loop review.
Validator: QA team (L4+) using schema validator + manual spot checks (n=50 minimum).

Human-in-the-Loop Design

Review Gates

Pre-deployment review: Builders must manually approve generated specs before registry submission. UI enforces "Confirm & Deploy" step with diff view showing changes vs initial prompt.
Ambiguity escalation: When confidence score <85%, generator surfaces specific clarification questions (e.g., "Should this webhook retry 3x or 5x on timeout?").

Before/After Narrative
Before: Priya (product designer) spends 3 hours drafting a skill to flag high-priority customer feedback. She misdefines the input schema, causing runtime failures. After 2 support tickets and 8 hours lost, she abandons the skill.
After: Priya types: "Watch Customer DB for ‘Urgent’ tagged rows. If found, message Slack channel CX-Alerts with row link. Retry twice if Slack fails." In 12 minutes, she reviews and deploys a working spec.

Trust & Guardrails

Risk Register
Risk: Hallucinated DB dependencies cause skill runtime failures
Probability: Medium | Impact: High
Mitigation: Cross-check generated DB properties against user’s actual workspace (owner: Backend team by 9/30). If property missing, prompt builder to add it pre-deployment.

Risk: Malicious actors craft prompts to generate harmful skills
Probability: Low | Impact: Critical
Mitigation: Pre-filter inputs via deny-list (PII extraction, unethical actions) + post-generation audit for policy violations (owner: Trust & Safety, daily log review).

Risk: Over-reliance on auto-gen specs reduces builder schema literacy
Probability: High | Impact: Medium
Mitigation: Embed "Learn why this schema works" tooltips in review UI linking to documentation (owner: DevEx by launch).

Kill Criteria

0.5% of production skills fail due to schema errors in D30
Builder time savings <50% of target at D90
Fallback logic coverage <80% in high-severity use cases (payment ops, security)

Bias & Risk Mitigation

Failure Modes & Mitigations

Language bias: Model performs poorly on non-English prompts.
Mitigation: Non-English prompts trigger "Translation confidence low" warning + option for manual spec override.
Complexity bias: Over-simplifies advanced skills (e.g., multi-DB joins).
Mitigation: Auto-detect compound dependencies → recommend breaking into sub-skills with dependency mapping.
Default bias: Falls back to "notify admin" for all errors.
Mitigation: Risk-classification layer — timeout errors suggest retries; auth errors suggest abort.

Validation

Test with 50 non-English prompts (JA/ES/FR) measuring schema validity parity.
Audit 20% of generated fallback logic for context appropriateness (T&S lead).