Agency Operations

AI Deliverables Quality Control: How Agencies Maintain Standards

How agency teams should QA AI-generated deliverables: rubrics, evaluation harnesses, editorial workflows, and the systems that prevent quality drift.

Bilal Azhar
Bilal Azhar
12 min read
#ai quality control#agency operations#editorial review#qa#ai workflows

The fastest way to lose a client in 2026 is to ship an AI-generated deliverable that contains a hallucinated statistic, a misattributed quote, or a fabricated case study. The damage is not "your work was bad." The damage is "your work was confidently and professionally bad in a way that would have been caught by a sober human two years ago." Gartner's 2026 survey of marketing services buyers found that 41% had received a deliverable from an agency partner containing AI-generated factual errors in the prior 12 months, and 67% of those buyers said the incident materially damaged their trust in the agency.

This guide is the QA framework for AI-generated agency deliverables. It covers the specific failure modes you should be testing for, the checklists that catch them before delivery, the brand voice and factual accuracy workflows that prevent drift, the client disclosure posture that protects the relationship, and the operational discipline that scales QA as your AI-augmented production volume grows. It assumes you are already producing AI-assisted work and that the question is no longer "should we?" but "how do we not embarrass ourselves?"

Key Takeaways:

  • Run every AI-generated deliverable through a four-pillar QA: hallucination check, source verification, brand voice consistency, and structural / formatting integrity.
  • Maintain a per-client brand voice file (250 to 500 words of explicit do/don't) and use it as the QA reference, not the agency's general taste.
  • Verify every factual claim against a primary source; if you cannot verify, remove or reword. Never "assume" the model is right.
  • Disclose AI assistance proactively in your MSA or onboarding, not reactively when a client asks; the trust gain is asymmetric.
  • Build a 5% to 10% sampling audit on top of 100% editorial review; the audit catches drift after model or prompt updates that editorial review misses.

The Four Failure Modes You Are Testing For

AI-generated deliverables fail in patterns that human-generated deliverables do not. The QA framework needs to be specific to those patterns, not adapted from generic editorial review.

| Failure Mode | What It Looks Like | Why It's Worse Than a Human Error | | --- | --- | --- | | Hallucination | Made-up statistics, fake quotes, invented studies, nonexistent sources | Looks plausible, cites a plausible-sounding source, evades casual review | | Source misattribution | Real statistic credited to wrong author / publication / date | Even worse than fabrication; introduces legal exposure | | Brand voice drift | Generic "default model" voice instead of client's voice | Reads as a template; clients feel the agency phoned it in | | Structural / factual inconsistency | Two sections of the same deliverable contradict each other | Suggests no human ever read the full document | | Stale knowledge | Statistics or claims from before the model's cutoff date presented as current | Particularly common in fast-moving industries | | Bland generic phrasing | Output reads like every other AI-generated piece on the topic | Erodes the perceived value of the deliverable category overall |

Each of these is catchable with the right checklist applied by a human reviewer. None of them is catchable by another LLM scoring the output on its own — though sampled automated scoring is a useful addition. The frame: AI-generated content needs at least one human who has read it end to end, with a specific list of things to check.

The Four-Pillar QA Framework

Run every AI-generated deliverable through these four pillars before client delivery. The framework is designed to fit in 15 to 35 minutes per 2,000-word piece, which is the productivity threshold above which AI-assisted production is profitable for most agencies.

Pillar 1: Hallucination Check (10 to 15 minutes)

This is the highest-stakes pillar. The checklist:

  • [ ] Every named person, company, study, or publication is verified to exist.
  • [ ] Every numeric claim (percentage, dollar amount, count) is traceable to a primary source.
  • [ ] Every direct quote is verified to the cited speaker, ideally with the original source link.
  • [ ] Every "X% of companies say Y" claim has a real survey or report behind it, with a publication date.
  • [ ] No claim is supported only by "a recent study showed" without naming the study.

The fastest way to run this: highlight every fact-claim in the draft and look up each one. If a claim cannot be verified in under 2 minutes, it goes. The model is almost never the right source.

Pillar 2: Source Verification (5 to 10 minutes)

Distinct from hallucination check — this is about ensuring the sources you do cite are correct, current, and authoritative.

  • [ ] Every cited URL resolves to the claimed content (link rot is real).
  • [ ] Every cited statistic matches what the source actually says (paraphrasing can drift meaning).
  • [ ] Source publication dates are recent enough to be relevant (2 years or newer for fast-moving topics).
  • [ ] Sources are authoritative for the claim being made (a vendor blog is not authoritative for industry benchmarks).
  • [ ] No source is a content farm, AI-generated site, or aggregator without primary attribution.

Pillar 3: Brand Voice Consistency (5 to 10 minutes)

Read against the client's brand voice file (see below). The checklist:

  • [ ] Sentence rhythm matches client norm (short and punchy vs. long and analytical).
  • [ ] Vocabulary level matches (technical, professional, conversational).
  • [ ] First/second/third person usage matches client convention.
  • [ ] Phrases on the "do not use" list are absent.
  • [ ] Phrases on the "use sparingly" list are not over-indexed.
  • [ ] Tone matches the client's positioning (authoritative vs. friendly vs. provocative).

Pillar 4: Structural and Formatting Integrity (3 to 5 minutes)

  • [ ] Heading hierarchy is correct (H1, H2, H3 used as structure, not styling).
  • [ ] Lists and tables are consistently formatted.
  • [ ] Internal links point to live URLs on the correct domain.
  • [ ] Word count is within 10% of brief.
  • [ ] No section is internally inconsistent with another.
  • [ ] Accessibility basics (alt text, link text, color contrast in visual deliverables).

Total time: 23 to 40 minutes per 2,000-word piece. The senior editor running this checklist is paid 65 to 95 USD per hour fully loaded; the QA pass costs roughly 25 to 65 USD per deliverable, which should be 4% to 8% of the deliverable's billable value. If your QA is taking longer than that, you have either an under-trained editor or a producer who needs better prompts.

Building the Per-Client Brand Voice File

The brand voice file is the reference document the editor checks against. It is not the same as the agency's general style guide; it is client-specific and lives in the client portal or wherever your team accesses client briefs.

A working brand voice file structure:

  1. Voice descriptors (3 to 5 adjectives): "Authoritative, direct, lightly skeptical, never preachy"
  2. Sentence rhythm: "Mix of short punchy sentences (8 to 12 words) and longer analytical ones (20 to 30 words). Avoid medium-length sentences without rhythm variation."
  3. Person and voice: "Second person ('you') for instructional content; first-person plural ('we') for opinion."
  4. Vocabulary level: "Educated professional. Assume B2B SaaS audience. Avoid jargon-heavy industry terms without unpacking them first."
  5. Three reference passages: 100 to 200 words each of writing the client considers exemplary, ideally from their own published work.
  6. Do-not-use list: Specific phrases, words, or patterns to avoid (e.g., "leveraging," "synergy," "in today's fast-paced world").
  7. Use-sparingly list: Phrases that are fine in moderation but ring as AI-generic when overused (e.g., "it's important to note," "moreover," "in conclusion").
  8. Hard constraints: Compliance, regulatory, or brand-safety rules (e.g., "never make first-person claims of patient outcomes" for healthcare clients).

Build this in the first 14 days of every retainer. Update it quarterly. Without a file like this, the editor is using "agency taste" as the reference, which produces inconsistency across editors and across deliverables.

The Hallucination Hunter's Workflow

Hallucination is the failure mode that produces the worst client incidents. A dedicated workflow:

  1. Extract claims into a spreadsheet. Every fact-statement in the draft, one row each. For a 2,000-word piece, expect 20 to 40 claims.
  2. Categorize by claim type. Statistic, quote, study citation, named entity, date, methodology.
  3. Verify each row against a primary source. Real publication, real study, real quote in real context.
  4. Log the source. Even claims that pass should be logged with their verification source — this builds an institutional reference library and protects you if a claim is challenged.
  5. Reword or remove unverifiable claims. No exceptions. "Probably true" is not a verification.

For high-stakes content (legal, medical, financial, regulated industries), every claim must be verified by a domain reviewer in addition to an editor. The cost is 2x the standard QA but it is non-negotiable — the legal exposure on a fabricated medical statistic in a healthcare client's deliverable far exceeds any margin gain from AI-assisted speed.

For ongoing categorization of which agency content categories carry the highest hallucination risk, the patterns from McKinsey's research on generative AI rollout failures consistently flag regulated industries, citation-dense content (whitepapers, reports, long-form research), and any deliverable involving named third parties as the highest-risk categories.

Brand Voice Drift Detection

Brand voice drift is sneaky. The first AI-generated deliverable for a client may match voice perfectly; the tenth may have drifted significantly without anyone noticing because each individual piece felt "fine."

A monthly drift audit:

  1. Pull the last 10 deliverables produced for the client.
  2. Score each on a 1-5 scale against the brand voice file across three dimensions: vocabulary, rhythm, tone.
  3. Plot the trend. If average scores have dropped more than 0.5 points over the period, the prompt template or producer needs recalibration.
  4. Pull 2 to 3 specific examples of drift and update the brand voice file's do-not-use list accordingly.

This audit takes 60 to 90 minutes per client per month and dramatically reduces the slow-creep failure mode that causes clients to send "this doesn't sound like us anymore" emails three months in.

Client Disclosure Posture

Whether to disclose AI use is no longer the question. The question is when and how. The Gartner research above found that 78% of B2B buyers in 2026 assume their agency is using AI in production; the issue is whether the agency is honest about it.

The disclosure posture that builds trust:

| When | What to Disclose | How | | --- | --- | --- | | Onboarding / MSA | General use of AI tools in production; QA process; data handling | Written clause in MSA | | Kickoff | Specific tools used for this engagement; client data posture | 1-page document or kickoff slide | | Per-deliverable | Not required, but available on request | Internal documentation | | Incident (error reaches client) | Full transparency: what happened, how QA missed it, what's fixed | Live conversation + written follow-up |

A working MSA clause: "Agency uses generative AI tools as part of its production workflow, including [specific tools]. All AI-generated content is reviewed by a human editor against agency QA standards before client delivery. Client data is handled according to the data processing addendum and is not used to train third-party models. The agency disclaims no responsibility for deliverable quality regardless of production tool."

The last sentence matters. AI assistance does not transfer accountability. If the work is wrong, the agency owns it.

For more on data privacy and AI in regulated industries, see our agency data privacy compliance guide and our healthcare marketing compliance deep-dive.

Sampled Automated Evaluation

100% editorial review catches individual defects. It misses systemic drift — the kind of degradation that happens after a model update, a prompt change, or a new producer joins the team. The fix is a sampled automated audit running on top of editorial review.

A working sampling protocol:

  • Sample 5% to 10% of weekly production output per service line.
  • Run a structured evaluation prompt through a different model than the one used for production.
  • Score against the same four-pillar rubric: hallucination risk, source quality, voice match, structure.
  • Track scores over time per producer, per client, per prompt template.
  • Alert when rolling 4-week average scores drop more than 0.4 points on any pillar.

The infrastructure for this is lightweight: a weekly script that pulls random deliverables from production, runs an evaluation prompt, and posts scores to a Slack channel or dashboard. Tools like Braintrust, Langfuse, or Helicone formalize this, but a 200-line Python script run weekly is enough for most agencies.

When alerts fire, freeze production briefly and investigate. Usually the cause is a prompt template that aged badly, a model version change that shifted output style, or a new producer whose work needs calibration. Catching it within 2 weeks is the difference between a single edited deliverable and 40 clients receiving subtly-off work.

The Tooling Stack

A practical AI QA tooling stack for a 10 to 30 person agency:

| Layer | Tool Examples | Annual Cost (10-30 person agency) | | --- | --- | --- | | Editorial workflow | Notion, Coda, project management | 0 to 2,400 USD | | Prompt library and version control | Git repo, PromptLayer, internal Notion | 0 to 1,800 USD | | Sampled automated eval | Braintrust, Langfuse, Helicone, custom scripts | 1,200 to 12,000 USD | | Fact-checking | Manual + model-assisted claim extraction | Mostly labor | | Plagiarism / originality | Originality.ai, Copyscape | 600 to 3,600 USD | | Accessibility | axe, WAVE, Lighthouse | Free | | Brand voice reference | Stored in client portal per client | Included |

Total incremental tooling cost: 2,000 to 20,000 USD per year. The labor cost of the QA itself (the editorial pass and the spot-check audit) is the larger expense and typically runs 6% to 12% of AI-augmented production hours.

Handling Quality Incidents

When a defective deliverable reaches a client — and it will, eventually — the response framework:

  1. Acknowledge within 24 hours. Do not delay while you investigate.
  2. Investigate the root cause. Was it a prompt, a model update, an editor miss, a workflow gap? Be specific.
  3. Fix the systemic issue. A patched deliverable is not enough; the underlying cause must change.
  4. Communicate to the client. What happened, what was wrong, what is fixed, what you are doing differently going forward.
  5. Document in the QA log. Every incident, every root cause, every fix. Review quarterly for patterns.

The Harvard Business Review's customer retention research consistently shows that B2B clients who receive a transparent, accountable response to a service incident are more loyal afterward than clients who never experienced one. Hiding or minimizing the incident is the failure mode that loses the relationship.

Producer Training and Prompt Hygiene

The producer side of QA matters as much as the editor side. A few disciplines:

  • Prompt templates per client per content type. Documented, version-controlled, reviewed quarterly. No one-off prompts in production work.
  • Required brand voice file reference in every prompt. The producer is responsible for loading the right voice context, not the editor for repairing voice drift.
  • Source-required prompting. "Cite sources for every numeric claim, with a real URL" goes in every prompt template. The model is still capable of hallucinating citations, but the rate drops substantially.
  • Quarterly producer recalibration. Each producer runs through 3 to 5 reference deliverables per quarter and reviews their output against the brand voice file with a senior editor.

For deeper operational practices around AI-augmented production, see our agency automation guide.

Measuring QA Effectiveness

Track these metrics monthly per service line:

  • Average editorial pass rate (first-time approval): Should be 70%+ for mature workflows.
  • Average revision rounds per deliverable: Should be 1.2 to 1.5.
  • Client-reported defects: Tracked separately; investigate every one.
  • QA labor as % of production labor: Should sit at 8% to 15%.
  • Sampled audit score (rolling 4-week average) per pillar: Watch for trends.
  • Incidents (defective deliverable reaches client): Should be near zero; investigate root cause on every one.

Rising production volume with falling rubric scores is the warning sign for AI agencies. Catch it early and you protect the brand. Miss it and the next quarterly buyer survey will surface defects you had no idea about.

Frequently Asked Questions

How much QA is enough for AI-generated content?

100% editorial review with the four-pillar checklist (23 to 40 minutes per 2,000-word piece) plus 5% to 10% sampled automated audit. High-stakes regulated content gets an additional domain review on every piece. Below this, you are taking on quality risk that will eventually surface as a client incident.

Can we use AI to QA AI?

Yes, for sampled audits — but not as the primary review. A separate model running the rubric against a sample of output catches systemic drift well. It misses individual hallucinations and brand voice nuances that a trained human editor catches. The two layers complement each other; neither replaces the other.

Who should own QA in an AI-augmented agency?

A senior editorial lead or a head of quality, reporting to the head of delivery or COO. They own the rubric, the brand voice files, the prompt library, the sampled audit infrastructure, and the producer training. In agencies under 20 people, this often sits with the founder or operations lead until volume justifies a dedicated role.

How do we disclose AI use to clients without scaring them?

Front-load disclosure in the MSA and onboarding, frame it as a productivity tool with mature QA wrapped around it, and emphasize that accountability remains with the agency. Most clients in 2026 already assume AI is part of agency production; they reward transparency, not absence.

What's the right way to handle a client who explicitly forbids AI use?

Honor it in writing in the SOW, document the workflow as fully human-produced, and price the engagement accordingly (typically 25% to 60% higher than AI-augmented work depending on content type). Some clients are right to want this — regulated industries, sensitive topics, or clients with explicit anti-AI brand positioning. Others will relax the restriction after a quarter of relationship-building; have the conversation again at the first QBR.

QA Is Now a First-Class Agency Capability

AI-assisted production raised the floor on speed and lowered the floor on quality. The agencies that win in 2026 are not the ones with the most aggressive AI adoption; they are the ones with the most disciplined QA wrapped around it. The four-pillar framework, the per-client brand voice file, the sampled audit, and the transparent client disclosure posture are the operational machinery of trustworthy AI-augmented delivery.

Centralize editorial workflows, brand voice files, and QA logs in one platform built for agencies running AI-augmented production at scale. Book a demo.

About the Author

Bilal Azhar
Bilal AzharCo-Founder & CEO

Co-Founder & CEO at AgencyPro. Former agency owner writing about the operational lessons learned from running and scaling service businesses.

Continue Reading

Ready to Transform Your Agency?

Join thousands of agencies already using AgencyPro to streamline their operations and delight their clients.