How to Monitor LLM Calls in Production: A Complete Setup Guide

Most teams add monitoring as an afterthought — after the first incident that left them debugging in the dark. By then, they've already lost the data they needed.

This post covers the complete monitoring setup for production LLM-based workflows: what to instrument, what metrics to track, how to structure alerts, and which tools to use.

Why Standard Application Monitoring Isn't Enough

Infrastructure monitoring tells you the service is up. It doesn't tell you whether the model is producing correct outputs, whether latency is acceptable at p95, or whether costs are on track.

An LLM-based workflow can be "up" by every infrastructure metric while silently returning hallucinated outputs, exceeding its cost budget, or suffering from prompt drift after a provider-side model update. Standard monitoring doesn't catch any of these.

AI-specific monitoring instruments what the model does, not just whether the service runs.

1. LLM Call Tracing

The foundation of AI observability is a structured log of every LLM call.

What to log per call:

Timestamp (request and response)
Model and model version
Prompt template ID and version
Rendered input (the actual prompt sent, after variable substitution)
Raw output
Latency (total and time-to-first-token if streaming)
Token counts (input and output)
Calculated cost
Pass/fail from any output validation

Logging the rendered prompt — not just the template — is the part most teams skip. When a production incident occurs, you need to know exactly what the model was asked, not what the template looked like at design time.

Prompt versioning: Tag every logged call with the prompt template version. This lets you query "all calls where prompt version was v12" and compare quality before and after changes.

Tools: LangSmith integrates directly with LangChain and LangGraph. Langfuse is open-source and self-hostable. Helicone is proxy-based — one URL change to your OpenAI client and every call is captured automatically.

2. Cost Tracking and Budget Alerts

LLM costs surprise teams in two ways: the per-token cost compounds faster than expected at scale, and individual calls can be far more expensive than estimated when users provide longer inputs or trigger multi-step reasoning chains.

What to track:

Cost per request (token counts × per-token rate)
Cost by workflow or endpoint
Daily and monthly totals
Week-over-week cost trend

Alerts to configure:

Single request cost exceeds threshold — signals a runaway generation or unexpected input
Daily cost exceeds budget cap — signals sustained overuse
Week-over-week cost increase exceeds 30% — signals a systemic change

Setting a hard spending limit on the API account itself (not just an alert) provides a final backstop.

3. Latency at p50 and p95

Average latency is misleading for LLM workloads. The distribution has a long tail: most requests might complete in 1.1 seconds, but 5% might take 8+ seconds — due to output length variation, provider-side load, or retry logic engaging.

For user-facing AI features, p95 latency is the user experience that 1 in 20 users encounters. That's the metric to set SLA thresholds against, not average.

What to track:

p50 (baseline indicator)
p95 (primary user experience metric and SLA target)
p99 (for SLA definition)
Time-to-first-token, separately from total latency (important for streaming interfaces)

Alert: p95 exceeds your defined SLA threshold for more than 5 consecutive minutes.

4. Output Quality Sampling

Automated metrics capture performance and cost. They don't tell you whether the model is producing correct, useful outputs. Quality sampling fills this gap.

Take 1-5% of production outputs, route them to a human review queue, and score them against defined quality criteria. This gives you a quality signal that's impossible to get from logs alone.

How to structure it:

Sample randomly, weighted toward low-confidence outputs if your workflow produces confidence scores
Define explicit scoring criteria (factually correct? correct format? appropriate for context?)
Track quality score over time — degradation is a leading indicator of model drift or prompt issues
Feed reviewed outputs into your eval dataset

Automated quality signals to complement human review: output length distribution (a sudden shift signals something changed), output format compliance rate, and LLM-as-judge scoring on sampled outputs.

5. Hallucination Rate Tracking

For workflows that generate factual content or cite sources, hallucination monitoring is essential.

The two-layer approach:

Automated consistency checking: For RAG systems, check that cited passages actually exist in the retrieved documents. For extraction workflows, compare extracted values against source data. These checks can be automated and run on 100% of outputs.

Sampled LLM-as-judge evaluation: Use a separate LLM to evaluate a sample of outputs for factual consistency. Prompt it to act as a skeptical reviewer: "Does this output make any claims not supported by the provided context?" Track the flag rate over time.

6. Model Drift Detection

Model behavior changes over time without any action on your end. Providers update underlying models. The distribution of user inputs shifts. Upstream data sources change. Drift is slow and hard to notice until it's significant.

The fixed eval set approach: Run a fixed set of 100-200 representative inputs through your workflow weekly. Compare outputs against expected outputs using LLM-as-judge scoring. Track the score over time. A sustained decline signals drift — before users notice.

7. User Feedback Integration

Every production AI workflow should have at minimum a thumbs-up/thumbs-down button on outputs. The feedback should flow into a review queue, your eval dataset as negative examples, and a quality metric (percentage of outputs flagged).

Watch the feedback rate as a signal. If it's stable at 0.8% and suddenly rises to 3%, something changed. Detect it before a qualitative complaint reaches your inbox.

The Minimum Stack Before Launch

1. LLM call tracing with prompt version tagging — required before launch

2. Cost per request tracking with daily budget alert — required before launch

3. p95 latency alerting — required before launch

4. User feedback capture (two-click rating, review queue) — within first week

5. Sampled human quality review (1-5% of outputs, weekly) — within first two weeks

6. Fixed eval set running on a weekly schedule — within first month

This setup catches the majority of production issues before they become user-visible. It also maps directly to the 14 controls in the Observability & Monitoring dimension of the production readiness checklist.

Assess your AI workflow to see how it scores across all 9 dimensions.