The operating principles
Outcome-first: define the artifact (e.g., 120-word email, JSON with fixed fields, SQL query). If you can’t name it, don’t prompt yet.
Spend where it matters: cheap models for classify/extract/condense; strong model only for final synthesis or tricky reasoning.
Token discipline: cap input/output, ban rambling (“return final result only”), and compress context before you call the strong model.
Schemas over examples: zero-shot + strict schema beats long few-shot for cost and stability.
Draft → Verify → Escalate: one mid-tier draft, a tiny verifier, and escalate to the best model only on failure.
Measure → prune: log tokens, dollars, pass rate, and escalation rate; remove low-value instructions monthly.
A lean, repeatable pipeline
1) Frame the task
Goal (one line), constraints (length, tone, format), budget (max input/output tokens), checks (3–5 pass/fail rules).
Mini-spec
Goal: Produce a 120-word customer apology email.
Constraints: ≤120 words; warm; include order#, ETA, coupon; no excuses.
Budget: max_output_tokens=160
Checks: [length, tone, order#, ETA, coupon]
2) Route to the right model
Small: classify, extract, compress.
Mid: first draft of prose/code.
Best: single pass only if verifier fails or complex reasoning is required.
3) Put hard caps in the prompt
Set max_tokens
, add stop
if applicable, temperature 0–0.3 for deterministic tasks.
Add: “Do not explain your reasoning. Return only the final result in the required format.”
4) Compress inputs before synthesis
Summarize long text to ≤70–120 tokens, preserving numbers, names, decisions.
Keep ≤3–5 snippets; dedupe hard.
Compressor
Condense to ≤90 tokens as bullets. Preserve facts, numbers, decisions; drop qualifiers.
5) Prefer schema to few-shot
Schema guard
Return JSON:
{ "title":"≤60 chars", "summary":"≤120 words", "actions":["...", "...", "..."] }
No other keys. No explanations.
6) Draft → Verify → Escalate (GSCP-lite)
Draft (mid): produce result under schema.
Verify (small): check schema validity + 3–5 rules; return {"pass":true|false,"failed":["..."]}
.
Fix/Escalate: one corrective pass with mid; otherwise one pass with best.
7) Cache & reuse
Reuse embeddings, compressor outputs, and common system prompts by ID.
Keep a small library of tone/style snippets to avoid repeating them inline.
8) Observe and optimize
Track per request: input tokens, output tokens, cost, latency, pass rate, escalation rate.
Targets: escalation ≤10%, context ≤300–500 tokens for typical knowledge tasks.
Ready-to-use templates
A) System (cost guardrails)
You are a cost-aware assistant. Rules:
1) Follow the schema exactly; no extra keys or commentary.
2) Respect max_tokens and stop sequences.
3) If unsure, output "INSUFFICIENT_CONTEXT".
4) Keep answers as short as constraints allow.
B) Task (user)
Task: <one sentence>
Inputs: <compressed bullets only>
Format: <schema>
Constraints: length≤N; tone=<X>; banned=<list>
Budget: max_output_tokens=M
Return only the formatted result.
C) Verifier (small model)
Given draft JSON and rules [r1..r5], return:
{"pass": true|false, "failed": ["rX", ...]}
No prose.
D) Escalation (only if needed)
Draft failed rules: ["..."]. Produce a corrected final that satisfies all rules.
Use only provided inputs. No new facts. No explanations.
Three concrete patterns
1) Short email (best cost)
Inputs: order#, delay days, new ETA, coupon.
Draft on mid model with ≤120 words schema.
Verify tone + fields present.
Escalate only if fail.
2) JSON extraction
Small model extracts to strict schema (types, ranges).
Verifier enforces types and required keys.
No strong model unless normalization fails.
3) SQL from natural language
Small: compress table/column descriptions to ≤150 tokens.
Mid: draft SQL with safety constraints (no DELETE
, UPDATE
; LIMIT 50
).
Verify: schema references exist; query parses; includes LIMIT
.
Cost levers that actually work
Shrink input, first: it scales across every call.
Keep outputs short: exact word/line budgets.
Reduce examples: 0–1 high-quality shot only.
Lower top-k in retrieval (typically 3, rarely 5).
Streaming UI: show progress without increasing token budgets.
Batch verifications: one small call validates many drafts.
Back-of-envelope
Cost ≈ input_tokens * $/1k_in + output_tokens * $/1k_out
Priority: reduce input tokens; then cap output.
Troubleshooting quick fixes
Verbose answers → lower max_tokens
, add stop, add “no explanations”.
Hallucinations → require citations or allow INSUFFICIENT_CONTEXT
; feed fewer, cleaner snippets.
Format drift → JSON schema + strict verifier; reject and re-prompt.
High spend → move compression to small model; swap few-shot for schema; reduce retrieval top-k.
Pocket checklist
One-line goal, strict schema, tight constraints.
Token caps + “no reasoning text”.
≤3–5 compressed snippets; deduped.
Mid draft → small verify → best only if needed.
Log tokens, pass & escalation rates; prune monthly.
Use this pipeline and you’ll get consistent, high-quality outputs while paying only for the moments that truly demand top-tier reasoning.