What to log in an LLM app
A 200 OK from a model call hides refusals, hallucinations, truncated answers, and broken JSON. This builder generates the exact things to instrument in a production LLM system — the log schema, metrics, alerts, and OpenTelemetry attributes — tailored to your stack.
Pick your stack → copy the schema. Free to use, no signup. Built by a data engineer, not a vendor.
Capture these on every LLM call.
trace_idreqCorrelate every span/log of one user request. The single most useful field when debugging.
timestampreqWhen the call happened. Use UTC.
session_idGroup calls within one conversation/session to see multi-turn behaviour.
modelreqExact model id incl. version/date. Silent model swaps are a top cause of 'it got worse overnight'.
providerreqWhich API served the call — you will run more than one.
prompt_versionreqVersion every prompt template. You cannot debug a regression you can't pin to a prompt revision.
temperatureSampling params change outputs; capture them to reproduce a bad response.
input_tokensreqPrompt size. Drives cost and latency; the first thing to check on a cost spike.
output_tokensreqCompletion size. Runaway generations show up here first.
cost_usdreqComputed cost (tokens × model price). Log it at write time — back-calculating later is painful.
latency_msreqEnd-to-end wall time. Track percentiles, never the average.
finish_reasonreqstop | length | content_filter | tool_calls. 'length' means you truncated the answer — a silent quality bug.
statusreqok | error. The basis of your error rate.
error_typereqrate_limit | timeout | server_error | parse_error. Tells you whose fault it is.
retry_countHow many retries this request needed. Rising retries = upstream instability you'd otherwise miss.
refusalDid the model decline to answer? A refusal-rate spike is an early warning of a prompt or model regression.
user_feedbackThe only ground-truth signal you get for free. Wire a thumbs up/down and log it here.
ttft_msreqTime to first token — the latency users actually feel while streaming. Often matters more than total latency.
stream_duration_msFirst token → last token. Separates 'slow to start' from 'slow to finish'.
retrieval_latency_msreqTime spent in vector search. Isolates retrieval slowness from generation slowness.
num_chunksreqHow many chunks were sent to the model. Sudden changes hint at a retrieval or chunking bug.
top_similarity_scorereqScore of the best-matching chunk. A falling trend means retrieval is degrading (stale index, bad embeddings).
retrieved_chunk_idsWhich chunks were used. Lets you reproduce exactly what the model saw.
reranker_usedTrack whether the rerank path ran, so you can measure its effect.
{
"trace_id": "req_8f2a…",
"timestamp": "2026-06-12T14:03:21Z",
"model": "gpt-4o-mini-2024-07-18",
"provider": "openai",
"prompt_version": "support-answer@v7",
"input_tokens": 812,
"output_tokens": 143,
"cost_usd": 0.00021,
"latency_ms": 1840,
"finish_reason": "stop",
"status": "ok",
"error_type": null,
"refusal": false,
"ttft_ms": 410,
"retrieval_latency_ms": 120,
"num_chunks": 5,
"top_similarity_score": 0.81
}Download the full kit
Get this whole config as a Markdown file — schema, metrics, alerts, and OpenTelemetry attributes — plus a sharp take on production AI every Tuesday. One email, no spam.
Everything above is free to read & copy — email is only to download the file.
Why LLM observability is its own problem
Surveys keep landing on the same finding: evaluation and reliability is the number-one pain for teams shipping LLMs, and most teams are flying blind — a large share don't monitor their LLM calls at all, and the rest roll their own because generic APM tools don't understand tokens, prompts, or retrieval.
The gap is that a successful HTTP response is not a successful answer. An LLM call can return 200 OKand still refuse, hallucinate, hit the token limit mid-sentence, or emit JSON that won't parse. None of that shows up unless you capture LLM-specific signals on every call. That's what the schema above is for.
FAQ
Why is observability for LLM apps different from normal APIs?
A 200 OK from an LLM call tells you almost nothing — the response can be a refusal, a hallucination, truncated at the token limit, or invalid JSON, and your HTTP monitoring will call all of those 'success'. You have to capture LLM-specific signals (tokens, cost, finish reason, refusal, retrieval quality) to see what's actually happening.
What's the single most important field to log?
A trace_id that ties together every span of one request, closely followed by a versioned prompt id. Most 'it got worse overnight' incidents are a silent model or prompt change — you can only see those if every call is pinned to a model version and a prompt version.
Do I need a vendor tool to start?
No. One structured JSON log line per call — grep- and jq-able — beats no observability and is something you can ship today. Add OpenTelemetry or a backend like Grafana or Datadog once the schema is stable. This tool gives you the schema first.
This tool is a sample of what the newsletter does every week — turn the firehose of production-AI news into something you can actually use. One email, every Tuesday.
Read the newsletter →