Effective Prompt Engineering: Best Results at the Best Cost

John Godel
2d
1.5k
0
1

100

Article

The operating principles

Outcome-first: define the artifact (e.g., 120-word email, JSON with fixed fields, SQL query). If you can’t name it, don’t prompt yet.
Spend where it matters: cheap models for classify/extract/condense; strong model only for final synthesis or tricky reasoning.
Token discipline: cap input/output, ban rambling (“return final result only”), and compress context before you call the strong model.
Schemas over examples: zero-shot + strict schema beats long few-shot for cost and stability.
Draft → Verify → Escalate: one mid-tier draft, a tiny verifier, and escalate to the best model only on failure.
Measure → prune: log tokens, dollars, pass rate, and escalation rate; remove low-value instructions monthly.

A lean, repeatable pipeline

1) Frame the task

Goal (one line), constraints (length, tone, format), budget (max input/output tokens), checks (3–5 pass/fail rules).

Mini-spec

Goal: Produce a 120-word customer apology email.
Constraints: ≤120 words; warm; include order#, ETA, coupon; no excuses.
Budget: max_output_tokens=160
Checks: [length, tone, order#, ETA, coupon]

2) Route to the right model

Small: classify, extract, compress.
Mid: first draft of prose/code.
Best: single pass only if verifier fails or complex reasoning is required.

3) Put hard caps in the prompt

Set max_tokens, add stop if applicable, temperature 0–0.3 for deterministic tasks.
Add: “Do not explain your reasoning. Return only the final result in the required format.”

4) Compress inputs before synthesis

Summarize long text to ≤70–120 tokens, preserving numbers, names, decisions.
Keep ≤3–5 snippets; dedupe hard.

Compressor

Condense to ≤90 tokens as bullets. Preserve facts, numbers, decisions; drop qualifiers.

5) Prefer schema to few-shot

Define the output with strict keys, short value budgets, and forbid extras.

Schema guard

Return JSON:
{ "title":"≤60 chars", "summary":"≤120 words", "actions":["...", "...", "..."] }
No other keys. No explanations.

6) Draft → Verify → Escalate (GSCP-lite)

Draft (mid): produce result under schema.
Verify (small): check schema validity + 3–5 rules; return {"pass":true|false,"failed":["..."]}.
Fix/Escalate: one corrective pass with mid; otherwise one pass with best.

7) Cache & reuse

Reuse embeddings, compressor outputs, and common system prompts by ID.
Keep a small library of tone/style snippets to avoid repeating them inline.

8) Observe and optimize

Track per request: input tokens, output tokens, cost, latency, pass rate, escalation rate.
Targets: escalation ≤10%, context ≤300–500 tokens for typical knowledge tasks.

Ready-to-use templates

A) System (cost guardrails)

You are a cost-aware assistant. Rules:
1) Follow the schema exactly; no extra keys or commentary.
2) Respect max_tokens and stop sequences.
3) If unsure, output "INSUFFICIENT_CONTEXT".
4) Keep answers as short as constraints allow.

B) Task (user)

Task: <one sentence>
Inputs: <compressed bullets only>
Format: <schema>
Constraints: length≤N; tone=<X>; banned=<list>
Budget: max_output_tokens=M
Return only the formatted result.

C) Verifier (small model)

Given draft JSON and rules [r1..r5], return:
{"pass": true|false, "failed": ["rX", ...]}
No prose.

D) Escalation (only if needed)

Draft failed rules: ["..."]. Produce a corrected final that satisfies all rules.
Use only provided inputs. No new facts. No explanations.

Three concrete patterns

1) Short email (best cost)

Inputs: order#, delay days, new ETA, coupon.
Draft on mid model with ≤120 words schema.
Verify tone + fields present.
Escalate only if fail.

2) JSON extraction

Small model extracts to strict schema (types, ranges).
Verifier enforces types and required keys.
No strong model unless normalization fails.

3) SQL from natural language

Small: compress table/column descriptions to ≤150 tokens.
Mid: draft SQL with safety constraints (no DELETE, UPDATE; LIMIT 50).
Verify: schema references exist; query parses; includes LIMIT.

Cost levers that actually work

Shrink input, first: it scales across every call.
Keep outputs short: exact word/line budgets.
Reduce examples: 0–1 high-quality shot only.
Lower top-k in retrieval (typically 3, rarely 5).
Streaming UI: show progress without increasing token budgets.
Batch verifications: one small call validates many drafts.

Back-of-envelope

Cost ≈ input_tokens * $/1k_in + output_tokens * $/1k_out
Priority: reduce input tokens; then cap output.

Troubleshooting quick fixes

Verbose answers → lower max_tokens, add stop, add “no explanations”.
Hallucinations → require citations or allow INSUFFICIENT_CONTEXT; feed fewer, cleaner snippets.
Format drift → JSON schema + strict verifier; reject and re-prompt.
High spend → move compression to small model; swap few-shot for schema; reduce retrieval top-k.

Pocket checklist

One-line goal, strict schema, tight constraints.
Token caps + “no reasoning text”.
≤3–5 compressed snippets; deduped.
Mid draft → small verify → best only if needed.
Log tokens, pass & escalation rates; prune monthly.

Use this pipeline and you’ll get consistent, high-quality outputs while paying only for the moments that truly demand top-tier reasoning.