115 items · 9 dimensions · Grounded in NIST, AWS, Microsoft & Google frameworks

AI Workflow Readiness Checklist

The checklist our team runs through before any AI system goes to production. Scored: ✅ Done (2pts) · ⚠️ Partial (1pt) · ❌ Missing (0pts) · N/A (excluded).

85–100%

Production-ready

65–84%

Fragile — proceed with caution

40–64%

High risk — significant gaps

<40%

Not production-ready

Want ShipSmith to run this automatically?

Connect your repo and we'll score every AI workflow against these 115 controls — no manual work required.

How we conduct the assessment

Pre-work call

30 min

We align on scope, identify which dimensions apply to your situation, and share a preparation checklist.

Document review

2–3 days

You share relevant artefacts. We review asynchronously and pre-score what we can from documentation alone.

Guided review session

2–3 hrs

We walk through each dimension together. You answer questions, we score in real time, flag gaps, and capture context.

Scored report

2 days after

You receive a full scored report with dimension breakdown, prioritised recommendations ranked by ROI and effort.

What to prepare

Data Foundation

Data source inventory, pipeline docs, data quality reports, sample schema, lineage documentation

Model & Architecture

Model cards, prompt catalog, model registry records, benchmarking results, customisation decision log

Evaluation & QA

Eval dataset samples, automated eval pipeline, fairness test results, red-team exercise report

Observability & Monitoring

Observability dashboards, trace samples, alert configurations, compliance audit trail

Resilience & Production Engineering

Architecture diagram, load test results, rollback runbook, canary deployment config, CT pipeline docs

Security & Compliance

Access control policy, compliance certifications, incident response plan, RAI review records

Cost Management

Cost dashboards, token budget configs, model routing policy, cost model at scale

Adoption & Change Management

Org chart (AI team), runbook, adoption metrics, change management plan, Responsible AI scorecard

AI Governance & Risk Management

Org AI policy document, impact assessment records, named risk owner, vendor AI contracts, incident log

Grounded in established standards

Our checklist is not proprietary — it synthesises six established industry frameworks into a single actionable assessment. Each item is traceable back to a specific control or best practice in at least one of these sources.

NIST AI RMF

2023

Four core functions: GOVERN, MAP, MEASURE, MANAGE. The gold standard for AI risk management as an organisational discipline.

Risk ManagementView framework

AWS GenAI Lens

2025

Best practices for LLM/RAG/agent systems — excessive agency controls, prompt catalog governance, multi-agent tracing, bias drift monitoring.

GenAI ArchitectureView framework

AWS ML Lens

2025

Model registry gating, data poisoning protection, environment parity, and feature attribution drift monitoring.

ML OperationsView framework

Microsoft RAI

2022

Fairness evaluation (Fairlearn), model cards, error cohort analysis, content safety layers, and Discover-Protect-Govern deployment framework.

Responsible AIView framework

Google MLOps

2021

Three maturity levels (0→2) covering training-serving skew, feature stores, continuous training pipelines, and ML pipeline component testing.

MLOps MaturityView framework

OWASP LLM Top 10

2023

The ten most critical security risks for LLM applications — including prompt injection (#01), excessive agency (#08), and supply chain vulnerabilities (#05).

LLM SecurityView framework

Data Foundation

Google MLOps Maturity ModelAWS ML Lens (MLOPS06)NIST AI RMF MAP 2.1

Max 28 pts

✅⚠️❌

Data sources are inventoried and documented01

✅⚠️❌

Data quality has been assessed — error rates, completeness, freshness are measured02

✅⚠️❌

A data pipeline exists to feed the AI system (not manual exports)03

✅⚠️❌

The pipeline handles failures gracefully04

✅⚠️❌

Data freshness is monitored — stale data is flagged05

✅⚠️❌

PII and sensitive data is identified and handled appropriately06

✅⚠️❌

Document ingestion handles real-world messiness (scanned PDFs, mixed languages, unusual encodings)07

✅⚠️❌

There is a process to handle new data sources as the business evolves08

✅⚠️❌

Ground-truth labeled data exists to evaluate AI accuracy against09

✅⚠️❌

Data lineage is tracked — you can trace what data produced a given AI output10

✅⚠️❌

Training-serving skew is detected — features at inference are computed identically to training11

✅⚠️❌

Training data provenance is auditable — chain of custody from source to model is documented12

✅⚠️❌

Training data has been assessed for demographic underrepresentation and label imbalance13

✅⚠️❌

Anomaly detection runs on incoming training data to catch distribution shifts or poisoning attempts14

Model & Architecture

Google MLOps Maturity ModelAWS GenAI Lens (GENREL04)Microsoft Responsible AI Standard

Max 30 pts

✅⚠️❌

Model choice was selected based on benchmarking against the actual use case01

✅⚠️❌

The model in production is the right size for the task02

✅⚠️❌

System prompts are version-controlled and reviewed like code03

✅⚠️❌

Prompt changes go through a review/approval process before hitting production04

✅⚠️❌

Retrieval strategy (for RAG) has been benchmarked — not just defaults05

✅⚠️❌

Reranking is used to improve retrieval quality06

✅⚠️❌

The architecture handles the 'lost in the middle' problem07

✅⚠️❌

Context window usage is monitored08

✅⚠️❌

There is a clear upgrade path documented for when a better model is released09

✅⚠️❌

Fine-tuning has been considered and either done or consciously ruled out10

✅⚠️❌

A model card exists per deployed model — covering intended use, limitations, performance, and ethical considerations11

✅⚠️❌

A prompt catalog exists with versioning, test results per version, and rollback capability12

✅⚠️❌

A model registry gates deployment — no model reaches production without a registered, approved artifact13

✅⚠️❌

Vendor/model lock-in risk is assessed — deprecation policy, switching cost, and migration path are documented14

✅⚠️❌

The customisation decision (prompt vs RAG vs fine-tune) is documented with rationale, not just implicit15

Evaluation & QA

Microsoft Responsible AI (Fairlearn)AWS GenAI Lens (GENOPS01)NIST AI RMF MEASURE 2.5-2.6

Max 30 pts

✅⚠️❌

A curated eval dataset exists (real inputs paired with expected outputs)01

✅⚠️❌

The eval dataset is large enough to be statistically meaningful (50–100+ examples)02

✅⚠️❌

Evals run automatically before any deployment03

✅⚠️❌

Evals block deployment if scores drop below a defined threshold04

✅⚠️❌

LLM-as-judge scoring is used for open-ended outputs05

✅⚠️❌

Adversarial inputs are included in the eval set06

✅⚠️❌

RAG-specific metrics are tracked: faithfulness and context recall07

✅⚠️❌

The eval dataset is updated regularly with real production queries the system got wrong08

✅⚠️❌

A/B evaluation is used when testing model upgrades or prompt changes09

✅⚠️❌

Hallucination rate is measured and tracked over time10

✅⚠️❌

Eval dataset is stratified by user persona and edge-case cohort — coverage per stratum is tracked11

✅⚠️❌

Error rates are analysed across user subgroups to identify which cohorts the model underserves12

✅⚠️❌

A bias/fairness evaluation gate exists — quantified disparity metrics must pass a threshold before deployment13

✅⚠️❌

New model versions are tested in shadow mode against real traffic before any user exposure14

✅⚠️❌

A red-team exercise is conducted by a separate team before first production deployment and after major changes15

Observability & Monitoring

AWS GenAI Lens (GENOPS02-GENOPS03)AWS ML Lens (MLOPS06)Microsoft Responsible AI (Accountability)

Max 28 pts

✅⚠️❌

Every LLM call is logged with input, output, token count, latency, model, cost01

✅⚠️❌

Multi-step workflows have span-level tracing02

✅⚠️❌

Cost per request is tracked and attributed per user/workflow/team03

✅⚠️❌

Latency is monitored at p50, p90, p95 — not just average04

✅⚠️❌

Error rates are tracked and alert on anomalies05

✅⚠️❌

User feedback signals are captured and tied back to specific traces06

✅⚠️❌

There is a dashboard for business stakeholders (accuracy, usage, cost)07

✅⚠️❌

There is a separate technical dashboard for engineering (latency, error rates, token costs)08

✅⚠️❌

Semantic drift is monitored — alerts when production query distribution shifts09

✅⚠️❌

There is a defined SLA and a process for breach notification10

✅⚠️❌

Multi-agent traces capture all layers: LLM calls, tool calls, retrieval queries, guardrail events11

✅⚠️❌

Bias drift is monitored in production — fairness baseline set at deploy time, alert on threshold breach12

✅⚠️❌

A compliance audit trail exists: every model/prompt/config change is logged with who authorised it and when13

✅⚠️❌

Guardrail trigger events are logged and reviewed — content safety activations are tracked as a separate metric14

Resilience & Production Engineering

Google MLOps Level 1-2AWS ML Lens (MLREL04)AWS GenAI Lens (GENOPS05)

Max 26 pts

✅⚠️❌

Model fallbacks are implemented: if primary model fails, a fallback handles it01

✅⚠️❌

Retry logic is implemented with exponential backoff for transient errors02

✅⚠️❌

Retryable errors (429, 503) are distinguished from non-retryable ones (400)03

✅⚠️❌

Circuit breakers are in place: if provider error rate exceeds threshold, traffic shifts to fallback04

✅⚠️❌

The system degrades gracefully when the LLM is unavailable05

✅⚠️❌

Streaming is implemented where applicable06

✅⚠️❌

Semantic or prompt caching is used07

✅⚠️❌

Load testing has been done at 5–10x expected peak traffic08

✅⚠️❌

There is a rollback procedure for prompt/model changes that go wrong09

✅⚠️❌

Async processing is used for long-running tasks10

✅⚠️❌

Canary or blue/green deployment is in place for model updates — with automatic rollback triggers11

✅⚠️❌

A Continuous Training (CT) pipeline exists — retraining is triggered automatically by data drift or performance thresholds12

✅⚠️❌

ML pipeline stages have unit tests — data validation, feature transforms, and serving code are tested independently of model accuracy13

Security & Compliance

OWASP LLM Top 10AWS GenAI Lens (GENSEC02-GENSEC06)Microsoft Responsible AI (Privacy & Security)

Max 28 pts

✅⚠️❌

API keys are stored in a secrets manager — not in env files or code01

✅⚠️❌

Prompt injection risks are mitigated — system context separated from user input02

✅⚠️❌

User inputs are sanitized before being used in retrieval or tool calls03

✅⚠️❌

Outputs are validated post-generation for PII leakage and policy violations04

✅⚠️❌

Agentic tool execution is sandboxed with appropriate permissions05

✅⚠️❌

Agent tool calls are logged with the reasoning that led to them06

✅⚠️❌

Data residency requirements are met07

✅⚠️❌

The system has been reviewed for applicable regulatory compliance (GDPR, HIPAA, SOC 2, etc.)08

✅⚠️❌

There is a documented incident response plan for AI-specific failures09

✅⚠️❌

Access control is role-based — not every user can access every capability10

✅⚠️❌

Agentic systems have explicit permission boundaries — minimum required permissions, max action steps, and human confirmation before irreversible actions (OWASP LLM #08: Excessive Agency)11

✅⚠️❌

A dedicated content safety layer screens both inputs and outputs as a separate system component — with its own monitoring and logging12

✅⚠️❌

Training data supply chain is protected — dependency scanning for ML libraries, provenance verification for training data sources13

✅⚠️❌

A Responsible AI review is conducted before deploying any new AI capability — cross-functional sign-off including non-engineers14

Cost Management

AWS GenAI Lens (GENOPS05)Google MLOps Level 1

Max 22 pts

✅⚠️❌

Token budgets are defined per request type01

✅⚠️❌

Monthly AI spend is monitored with budget alerts02

✅⚠️❌

Cost per unit of business value is calculated and tracked03

✅⚠️❌

Model routing is used: simple tasks go to cheap models, complex to expensive04

✅⚠️❌

Batch inference is used for offline/non-latency-sensitive workflows05

✅⚠️❌

Prompt caching is enabled for stable large contexts06

✅⚠️❌

Runaway agent loops are prevented: hard step limit and cost cap per invocation07

✅⚠️❌

The cost impact of a model upgrade is assessed before deploying08

✅⚠️❌

Context window usage is optimized09

✅⚠️❌

There is a documented cost model for different usage scales10

✅⚠️❌

The cost model explicitly includes retraining pipeline compute — retraining frequency and cost per run are estimated11

Adoption & Change Management

NIST AI RMF GOVERN 5.1Microsoft Responsible AI (Transparency)

Max 26 pts

✅⚠️❌

There is a named owner responsible for the AI system's business outcomes01

✅⚠️❌

The specific daily workflows that changed are documented02

✅⚠️❌

Users who interact with the system were involved in its design03

✅⚠️❌

Training and onboarding for the system exists and has been completed04

✅⚠️❌

There is a feedback channel for users to report AI errors05

✅⚠️❌

Reported errors are acted on — there is a process to review and incorporate feedback06

✅⚠️❌

The system's adoption rate is measured (daily active users / eligible users)07

✅⚠️❌

Leadership has visibility into adoption metrics08

✅⚠️❌

There is a plan for expanding scope or user base once stable09

✅⚠️❌

The value delivered has been communicated internally10

✅⚠️❌

Users are informed they are interacting with AI, what it can/cannot do, and how to request human review of a decision11

✅⚠️❌

An AI-specific incident taxonomy exists — hallucination, bias event, privacy leak, adversarial attack — with severity levels and escalation paths separate from general engineering incidents12

✅⚠️❌

A Responsible AI scorecard is shared with non-technical stakeholders at least quarterly — covering performance trends, fairness results, incidents, and planned improvements13

AI Governance & Risk Management

NIST AI RMF (GOVERN 1.1-6.1)AWS GenAI Lens (multiple BPs)Microsoft Responsible AI Standard v2

Max 12 pts

✅⚠️❌

A written organisational AI policy exists — covering acceptable use, prohibited use cases, risk tolerance, and review cadence — owned by a named executive01

✅⚠️❌

An AI impact assessment is conducted before deploying any new AI capability — documenting who is affected, what harms could occur, and what mitigations are in place02

✅⚠️❌

A named AI risk owner exists who is distinct from the product/business owner and has authority to pause deployment03

✅⚠️❌

The feedback-to-improvement loop is a documented, owned process — specifying who reviews errors, on what cadence, and what triggers a prompt update vs. retraining vs. architecture change04

✅⚠️❌

A model deprecation policy exists — defining minimum performance floor, sunset timeline, data deletion obligations, and knowledge transfer requirements05

✅⚠️❌

Third-party AI vendors are assessed — data retention policies, training on your data, SLA guarantees, contractual data processing agreements, and switching cost are documented and reviewed at contract renewal06

Skip the manual checklist.
Let ShipSmith score your workflows automatically.

Connect your repo and our AI agent discovers every AI workflow and scores it against these 115 controls — in under 10 minutes. Free for your first workflow.

Scan Your Repo — Free →See pricing →

AI Workflow Readiness Checklist

Want ShipSmith to run this automatically?

How we conduct the assessment

What to prepare

Grounded in established standards

NIST AI RMF

AWS GenAI Lens

AWS ML Lens

Microsoft RAI

Google MLOps

OWASP LLM Top 10

Data Foundation

Model & Architecture

Evaluation & QA

Observability & Monitoring

Resilience & Production Engineering

Security & Compliance

Cost Management

Adoption & Change Management

AI Governance & Risk Management

Skip the manual checklist.Let ShipSmith score your workflows automatically.

Skip the manual checklist.
Let ShipSmith score your workflows automatically.