Prompt Engineering  

Effective Prompt Engineering: Best Results at the Best Cost

The operating principles

  1. Outcome-first: define the artifact (e.g., 120-word email, JSON with fixed fields, SQL query). If you can’t name it, don’t prompt yet.

  2. Spend where it matters: cheap models for classify/extract/condense; strong model only for final synthesis or tricky reasoning.

  3. Token discipline: cap input/output, ban rambling (“return final result only”), and compress context before you call the strong model.

  4. Schemas over examples: zero-shot + strict schema beats long few-shot for cost and stability.

  5. Draft → Verify → Escalate: one mid-tier draft, a tiny verifier, and escalate to the best model only on failure.

  6. Measure → prune: log tokens, dollars, pass rate, and escalation rate; remove low-value instructions monthly.

A lean, repeatable pipeline

1) Frame the task

  • Goal (one line), constraints (length, tone, format), budget (max input/output tokens), checks (3–5 pass/fail rules).

Mini-spec

Goal: Produce a 120-word customer apology email.
Constraints: ≤120 words; warm; include order#, ETA, coupon; no excuses.
Budget: max_output_tokens=160
Checks: [length, tone, order#, ETA, coupon]

2) Route to the right model

  • Small: classify, extract, compress.

  • Mid: first draft of prose/code.

  • Best: single pass only if verifier fails or complex reasoning is required.

3) Put hard caps in the prompt

  • Set max_tokens, add stop if applicable, temperature 0–0.3 for deterministic tasks.

  • Add: “Do not explain your reasoning. Return only the final result in the required format.”

4) Compress inputs before synthesis

  • Summarize long text to ≤70–120 tokens, preserving numbers, names, decisions.

  • Keep ≤3–5 snippets; dedupe hard.

Compressor

Condense to ≤90 tokens as bullets. Preserve facts, numbers, decisions; drop qualifiers.

5) Prefer schema to few-shot

  • Define the output with strict keys, short value budgets, and forbid extras.

Schema guard

Return JSON:
{ "title":"≤60 chars", "summary":"≤120 words", "actions":["...", "...", "..."] }
No other keys. No explanations.

6) Draft → Verify → Escalate (GSCP-lite)

  • Draft (mid): produce result under schema.

  • Verify (small): check schema validity + 3–5 rules; return {"pass":true|false,"failed":["..."]}.

  • Fix/Escalate: one corrective pass with mid; otherwise one pass with best.

7) Cache & reuse

  • Reuse embeddings, compressor outputs, and common system prompts by ID.

  • Keep a small library of tone/style snippets to avoid repeating them inline.

8) Observe and optimize

  • Track per request: input tokens, output tokens, cost, latency, pass rate, escalation rate.

  • Targets: escalation ≤10%, context ≤300–500 tokens for typical knowledge tasks.

Ready-to-use templates

A) System (cost guardrails)

You are a cost-aware assistant. Rules:
1) Follow the schema exactly; no extra keys or commentary.
2) Respect max_tokens and stop sequences.
3) If unsure, output "INSUFFICIENT_CONTEXT".
4) Keep answers as short as constraints allow.

B) Task (user)

Task: <one sentence>
Inputs: <compressed bullets only>
Format: <schema>
Constraints: length≤N; tone=<X>; banned=<list>
Budget: max_output_tokens=M
Return only the formatted result.

C) Verifier (small model)

Given draft JSON and rules [r1..r5], return:
{"pass": true|false, "failed": ["rX", ...]}
No prose.

D) Escalation (only if needed)

Draft failed rules: ["..."]. Produce a corrected final that satisfies all rules.
Use only provided inputs. No new facts. No explanations.

Three concrete patterns

1) Short email (best cost)

  • Inputs: order#, delay days, new ETA, coupon.

  • Draft on mid model with ≤120 words schema.

  • Verify tone + fields present.

  • Escalate only if fail.

2) JSON extraction

  • Small model extracts to strict schema (types, ranges).

  • Verifier enforces types and required keys.

  • No strong model unless normalization fails.

3) SQL from natural language

  • Small: compress table/column descriptions to ≤150 tokens.

  • Mid: draft SQL with safety constraints (no DELETE, UPDATE; LIMIT 50).

  • Verify: schema references exist; query parses; includes LIMIT.

Cost levers that actually work

  • Shrink input, first: it scales across every call.

  • Keep outputs short: exact word/line budgets.

  • Reduce examples: 0–1 high-quality shot only.

  • Lower top-k in retrieval (typically 3, rarely 5).

  • Streaming UI: show progress without increasing token budgets.

  • Batch verifications: one small call validates many drafts.

Back-of-envelope

Cost ≈ input_tokens * $/1k_in + output_tokens * $/1k_out
Priority: reduce input tokens; then cap output.

Troubleshooting quick fixes

  • Verbose answers → lower max_tokens, add stop, add “no explanations”.

  • Hallucinations → require citations or allow INSUFFICIENT_CONTEXT; feed fewer, cleaner snippets.

  • Format drift → JSON schema + strict verifier; reject and re-prompt.

  • High spend → move compression to small model; swap few-shot for schema; reduce retrieval top-k.

Pocket checklist

  • One-line goal, strict schema, tight constraints.

  • Token caps + “no reasoning text”.

  • ≤3–5 compressed snippets; deduped.

  • Mid draft → small verify → best only if needed.

  • Log tokens, pass & escalation rates; prune monthly.

Use this pipeline and you’ll get consistent, high-quality outputs while paying only for the moments that truly demand top-tier reasoning.