AI systems · Delivery

The AI integration playbook: from brief to production in 90 days

Not a sales deck with vague phases labeled strategy and execution. The real week-by-week structure with actual deliverables, the failure modes per phase, and the handoffs between them.

WritingMay 29, 202614 min read

Week 0 is where it dies

Not week 6. Not UAT. Week 0. I have watched three AI pilots collapse in the first two weeks. Not because the model was wrong or the integration was hard. Because the brief was garbage. The client did not actually know what they wanted the AI to do, or they knew abstractly ("AI for collections") but had not gotten specific enough for anyone to build against. Or there was a data problem nobody mentioned because nobody thought to check before kickoff.

Every one of those problems is fatal later and easy to catch early. Clients who push back on a proper discovery phase, "we already know what we want, just start building," consistently end up with a build that takes twice as long and a UAT that surfaces problems that should have died in week 1. Discovery is not delay. It is compression on the back end. This is the playbook we actually run.

Week 0: the brief is a decision document

The brief is not a discovery worksheet. By the end of five days it should answer four questions with enough precision that a developer who has never spoken to the client could build the right thing.

One: what are you actually trying to automate? Not the category, the workflow. Not "AI for collections" but "reduce the time a collections agent spends on initial borrower contact from 4 minutes to under 90 seconds, using an outbound agent that identifies the borrower, reads the account summary aloud, and transfers to the human agent with a live transcript." That sentence has a measurable baseline, a measurable target, a named workflow, and a defined handoff. Every word of the brief should be at this level.

Two: what data exists, and in what form? This is the conversation most clients avoid. We do not. We ask to see a sample export from every system the agent will touch before signing scope. Not a screenshot, an export or schema with row counts and date ranges. I have had three projects where the client was confident they had clean data and the reality was 40 percent null values in a required field, an Excel file last updated eight months ago, and a legacy database needing a proprietary driver. Finding that in week 1 costs a conversation. Finding it in week 5 costs a renegotiation.

Three: who are the actual end users and how technically capable are they? Not the IT team. The collections agent, the branch teller, the loan officer. An agent that is brilliant but requires a 40-minute onboarding will get used by 30 percent of the target users. We size the UX for the actual user, not the aspirational one. Four: what regulatory environment applies? For Philippine financial institutions that means BSP, NPC, and RA 10173 at minimum. The answer changes the audit trail architecture, which changes the agent data model, and that has to be known before ontology design starts. The signed brief is the client's explicit commitment to a specific outcome, and the document you return to when scope creep starts.

Weeks 1 to 2: ontology and data gap analysis

The ontology is the formal model of every important business object the system operates on. For a collections AI: customer, loan, payment history, collector, branch, contact attempt, escalation rule, regulatory constraint. Every object is defined: what it is, what fields it has, where they come from, what the data quality looks like, and the fallback logic when a field is missing. We draw it as a diagram first, every object a node, every relationship an edge with direction and cardinality. A customer has many loans. A loan has one primary collector and many contact attempts. A contact attempt maps to one regulatory window, because calling a borrower twice in a day in certain contexts violates BSP guidance. The agent is built against the ontology, not the underlying database. If the ontology is wrong, the agent is wrong.

The data gap analysis maps every field in every object to its actual source: "Customer.name, source core banking system, completeness 99.8%, format mixed case with occasional all caps, resolution normalize to title case." Every gap has a resolution strategy. If the strategy is "we don't have this data," that is recorded and the design adjusts. The "we don't actually have that data" bomb always drops in week 2, never week 0, because nobody on the client side verified before kickoff. That is recoverable at week 2 ontology review. It is a rebuild at week 4 agent dev.

Weeks 3 to 6: one agent, synthetic then real data

One agent. Not three, not five. Pick the highest-value workflow with the clearest scope and build that one first. Clients push for parallel development: "we have six workflows, shouldn't we build all of them at once?" No. The first agent is where you discover the ontology problems: the field that was not where the schema said, the API returning an unexpected format, the business-rule edge case the process doc never mentioned because everyone in the office already knows it. You want to discover those once, against one agent. If you are building six simultaneously and hit an ontology problem, you fix it six times.

Weeks 3 and 4 are pure development against the ontology with synthetic test data generated from the object definitions. The agent should run end-to-end on synthetic data by the end of week 4. Week 5 brings in real data for the first time, usually the most interesting week of the project. Real data surfaces outliers, encoding issues, records predating the current schema, migration artifacts. We run the agent against 10 to 20 percent of production volume and record every failure or low-confidence case. Week 6 fixes what week 5 surfaced. The goal: a failure rate below the threshold agreed in the brief, typically below 2 percent requiring manual intervention for a routing agent, below 5 percent extraction errors for a document agent.

Weeks 6 to 8: integration and real-user UAT

Wiring the agent into existing tools is the most fiddly phase, though not the hardest. Most integrations go in via REST API, some use webhooks, occasionally file-based for older systems. The edge cases surprise clients: the API that returns a 503 under production load, the webhook firing duplicate events when a record is saved twice, the auth token expiring after 24 hours with no refresh because in testing the token was always fresh. We document every integration edge case and verify it before UAT.

UAT is not IT testing, and most rollout plans get this wrong. IT testing verifies the system does what the spec says. UAT verifies that the actual end users find it usable, accurate, and integrated into their real workflow. Take an illustrative collections scenario: sit three actual collectors with the system for four hours each across two days, and the kind of friction that surfaces is telling. Two of three might hit the same snag: the agent reads the outstanding balance in a format that does not match the paper ledger they also consult, so they double-check every time rather than trusting the output. That friction is in no spec and invisible in IT testing. It can silently kill adoption. A fix like that takes a couple of hours; finding it takes having real users in the room. Budget for change management too: two sessions, one for orientation where complaints come out, one for feedback where trust is built. (Illustrative, not a signed-customer result.)

Weeks 8 to 10: hardening

Most clients want to skip hardening because it feels like polish. It is not. It is the difference between a system that works in UAT and one that works at 2am when nobody is watching. Error handling means every failure mode has a defined response. API timeout: retry three times with exponential backoff, log, surface to the dashboard. Confidence below threshold: route to manual review. Required field missing: halt, flag the record, notify the handler. Fallback behaviors keep the operation running at degraded but acceptable service: a routing agent that cannot reach its primary LLM falls back to a rule-based table rather than erroring.

Alert thresholds are defined per agent and per metric. Healthy for a collections routing agent typically means P95 decision latency under 2 seconds, error rate below 1.5 percent, manual review queue below 8 percent, and zero hard failures. Documentation comes last and is the thing developers least want to do: a technical spec for the team that owns the system, with code samples, and a user guide for the operators, with screenshots, covering what the agent does, what it needs, what it produces, and who to contact when something seems wrong.

Weeks 10 to 12: phased production ramp

We do not go full-volume on day one. We go live with 20 to 30 percent of production traffic for the first two weeks, selected to be representative of the full range of cases, not just the easy ones. We want to see the hard cases in production before giving the agent full volume. The dashboard runs continuously, watching error rate, manual review rate, confidence distribution, and latency. If they do not match staging, we want to know on day 3 with 20 percent traffic, not day 1 with 100 percent.

Edge cases appear in production that did not appear in UAT. They always do. Over the two weeks we collect every incorrect, unexpected, or low-confidence case, and these feed the tuning pass at week 11. For most agents this improves performance by 8 to 15 percent on the weakest metrics. It is not a rebuild, it is calibration. At week 12 we move to full volume and produce the first performance report: every metric from the brief with a before-state, a target, and an achieved value. If a metric is not at target, the report explains why and the remediation path. This is the formal close of the 90-day plan.

After 90 days: retainer or handoff

Every client who thinks they do not need ongoing maintenance is wrong, without exception. AI systems drift. Business rules change. A new product launches that the ontology does not model. BSP issues a new circular. The underlying models update and prompt behavior shifts slightly even when the prompts have not changed. None of these are catastrophic, but all require someone with intimate knowledge of the system to notice and respond. If nobody is watching, you discover silent degradation three months later when someone asks why collections accuracy dropped from 94 percent to 81 percent and nobody knows when.

Two paths are both legitimate. Hand off to an internal team that was involved in the build, sat in the ontology sessions, and understands the dashboard. Or retain us to monitor, respond to alerts, tune, and update the ontology. Most Philippine clients choose the retainer because their technical teams are already at capacity and the system is not their primary competency. The retainer is not lock-in. It is an acknowledgment that a live AI system is a living thing. The right question is not "do I need maintenance?", the answer is always yes, but "who should own it, my team or yours?"

The 90 days, compressed

  • Week 0 (days 1 to 5): brief, specific workflow, data evidence, user profile, regulatory scope. A signed document.
  • Weeks 1 to 2: ontology design, object model, data gap analysis, fallback logic. An approved diagram.
  • Weeks 3 to 6: first agent, synthetic testing (weeks 3 to 4), real-data testing (week 5), fixing (week 6). A staged agent with a performance report.
  • Weeks 6 to 8: integration and real-user UAT, API wiring, edge-case handling, change management. An integrated agent, UAT passed.
  • Weeks 8 to 10: hardening, error handling, fallbacks, alerting, dashboard, documentation. A production-ready system.
  • Weeks 10 to 12: production ramp at 20 to 30 percent first, edge-case collection, tuning pass, full volume. The first performance report.
  • Month 3 end: retainer or handoff decision. Someone owns this ongoing.

Want this working for your business?

We build the automation your team keeps meaning to build, then hand it over running. Book a call and we will map the first working slice.

All writing

Book a 20-minute call

Twenty minutes on a video call. We listen, you talk, we figure out together whether this is worth doing.

No slides, no demo, no pitch deck. You leave with a clearer sense of the shape and what it would take.

  • Tell us what is on fire and what is working, briefly.
  • We will ask a few specific questions about your stack and team.
  • You will get a clear yes, no, or referral by the end of the call.

Before you go

Want a free website mockup?

We will build a free mockup of your new site, no charge and no commitment, so you can see exactly what it would look like before you decide anything.