Free tool

What to log in an LLM app

A 200 OK from a model call hides refusals, hallucinations, truncated answers, and broken JSON. This builder generates the exact things to instrument in a production LLM system — the log schema, metrics, alerts, and OpenTelemetry attributes — tailored to your stack.

Pick your stack → copy the schema. Free to use, no signup. Built by a data engineer, not a vendor.

Capture these on every LLM call.

trace_idreq
string

Correlate every span/log of one user request. The single most useful field when debugging.

timestampreq
string (ISO-8601)

When the call happened. Use UTC.

session_id
string

Group calls within one conversation/session to see multi-turn behaviour.

modelreq
string

Exact model id incl. version/date. Silent model swaps are a top cause of 'it got worse overnight'.

providerreq
string

Which API served the call — you will run more than one.

prompt_versionreq
string

Version every prompt template. You cannot debug a regression you can't pin to a prompt revision.

temperature
float

Sampling params change outputs; capture them to reproduce a bad response.

input_tokensreq
int

Prompt size. Drives cost and latency; the first thing to check on a cost spike.

output_tokensreq
int

Completion size. Runaway generations show up here first.

cost_usdreq
float

Computed cost (tokens × model price). Log it at write time — back-calculating later is painful.

latency_msreq
int

End-to-end wall time. Track percentiles, never the average.

finish_reasonreq
string

stop | length | content_filter | tool_calls. 'length' means you truncated the answer — a silent quality bug.

statusreq
string

ok | error. The basis of your error rate.

error_typereq
string | null

rate_limit | timeout | server_error | parse_error. Tells you whose fault it is.

retry_count
int

How many retries this request needed. Rising retries = upstream instability you'd otherwise miss.

refusal
bool

Did the model decline to answer? A refusal-rate spike is an early warning of a prompt or model regression.

user_feedback
enum(up,down,null)

The only ground-truth signal you get for free. Wire a thumbs up/down and log it here.

ttft_msreq
int

Time to first token — the latency users actually feel while streaming. Often matters more than total latency.

stream_duration_ms
int

First token → last token. Separates 'slow to start' from 'slow to finish'.

retrieval_latency_msreq
int

Time spent in vector search. Isolates retrieval slowness from generation slowness.

num_chunksreq
int

How many chunks were sent to the model. Sudden changes hint at a retrieval or chunking bug.

top_similarity_scorereq
float

Score of the best-matching chunk. A falling trend means retrieval is degrading (stale index, bad embeddings).

retrieved_chunk_ids
string[]

Which chunks were used. Lets you reproduce exactly what the model saw.

reranker_used
bool

Track whether the rerank path ran, so you can measure its effect.

Sample log line
{
  "trace_id": "req_8f2a…",
  "timestamp": "2026-06-12T14:03:21Z",
  "model": "gpt-4o-mini-2024-07-18",
  "provider": "openai",
  "prompt_version": "support-answer@v7",
  "input_tokens": 812,
  "output_tokens": 143,
  "cost_usd": 0.00021,
  "latency_ms": 1840,
  "finish_reason": "stop",
  "status": "ok",
  "error_type": null,
  "refusal": false,
  "ttft_ms": 410,
  "retrieval_latency_ms": 120,
  "num_chunks": 5,
  "top_similarity_score": 0.81
}

Download the full kit

Get this whole config as a Markdown file — schema, metrics, alerts, and OpenTelemetry attributes — plus a sharp take on production AI every Tuesday. One email, no spam.

Everything above is free to read & copy — email is only to download the file.

Why LLM observability is its own problem

Surveys keep landing on the same finding: evaluation and reliability is the number-one pain for teams shipping LLMs, and most teams are flying blind — a large share don't monitor their LLM calls at all, and the rest roll their own because generic APM tools don't understand tokens, prompts, or retrieval.

The gap is that a successful HTTP response is not a successful answer. An LLM call can return 200 OKand still refuse, hallucinate, hit the token limit mid-sentence, or emit JSON that won't parse. None of that shows up unless you capture LLM-specific signals on every call. That's what the schema above is for.

FAQ

Why is observability for LLM apps different from normal APIs?

A 200 OK from an LLM call tells you almost nothing — the response can be a refusal, a hallucination, truncated at the token limit, or invalid JSON, and your HTTP monitoring will call all of those 'success'. You have to capture LLM-specific signals (tokens, cost, finish reason, refusal, retrieval quality) to see what's actually happening.

What's the single most important field to log?

A trace_id that ties together every span of one request, closely followed by a versioned prompt id. Most 'it got worse overnight' incidents are a silent model or prompt change — you can only see those if every call is pinned to a model version and a prompt version.

Do I need a vendor tool to start?

No. One structured JSON log line per call — grep- and jq-able — beats no observability and is something you can ship today. Add OpenTelemetry or a backend like Grafana or Datadog once the schema is stable. This tool gives you the schema first.

This tool is a sample of what the newsletter does every week — turn the firehose of production-AI news into something you can actually use. One email, every Tuesday.

Read the newsletter →