White Paper · CETI.AI · 2026

The ROI of Structured AI Training

Why Tool Rollouts Without Curriculum Are Backfiring — and What to Do Instead

An evidence-based briefing for engineering and L&D leaders. Drawing on independent research from McKinsey, BCG, Gartner, IDC, METR, GitHub, Deloitte, and Stack Overflow — this paper diagnoses why most AI tool rollouts stall, and presents a 90-day implementation playbook that any moderately resourced organization can execute.

Manu Mulaveesala
Founder, CETI.AI
2026 · Confidential

Executive Summary

In 2025, the question facing enterprise leadership shifted. It is no longer whether to deploy generative AI — that decision has been made, often several times over, across procurement, engineering, and learning budgets. The question is why, after eighteen months of rollouts, the financial return is so uneven.

The data reveals a paradox. According to McKinsey’s State of AI 2025, 78 percent of organizations now use AI in at least one business function, yet only 5.5 percent qualify as “high performers” capturing more than five percent EBIT impact from those investments[6]. Roughly 39 percent of enterprises report any measurable EBIT effect at all. Tools have proliferated. Outcomes have not.

The cause is structural, and it is not a tools problem. Boston Consulting Group, after surveying thousands of enterprise AI deployments, distilled the lesson into a single ratio that should anchor every executive AI conversation:

“AI value comes 10 percent from algorithms, 20 percent from data and tech, and 70 percent from people, processes, and culture.”

BCG, 2025 · To Unlock the Full Value of AI, Invest in Your People[10]

This white paper makes four arguments, each grounded in independent third-party research:

The implication for engineering, L&D, and finance leaders is direct. The next dollar of AI ROI is unlocked not by another seat license, but by the curriculum, measurement framework, and cultural infrastructure that surrounds it.

The Opportunity Is Real

Begin with the upper bound. The opportunity is not in dispute — its capture is.

McKinsey’s foundational 2023 analysis, reaffirmed in their 2025 reporting, sizes the annual economic potential of generative AI at $2.6 to $4.4 trillion, with approximately 75 percent of that value concentrated in four functions: customer operations, marketing and sales, software engineering, and R&D[1]. For software engineering specifically — the function most directly served by tools like Copilot, Cursor, and Claude — the impact estimate is among the largest of any function studied.

Controlled studies bear this out under favorable conditions. In a randomized experiment conducted by GitHub and Microsoft Research in September 2022, developers using GitHub Copilot completed a benchmark coding task 55.8 percent faster than the control group — 1 hour 11 minutes versus 2 hours 41 minutes — with a 95 percent confidence interval ranging from 21 to 89 percent[2].

That number is real, and it is a ceiling, not a floor. The task was a greenfield JavaScript HTTP server build: small surface area, no legacy context, no production constraints, no review cycle. Enterprise software engineering looks different. The honest reading of the GitHub study is that AI tooling can roughly halve the time required for greenfield, well-bounded coding tasks under controlled conditions — and that the variance is wide enough that organizations should not budget against the midpoint without their own measurement.

Enterprise-scale deployments have begun to produce evidence at production scale. In late 2025, HUB International — one of the largest insurance brokerages in North America — announced the rollout of Anthropic’s Claude to more than 20,000 employees[3]. Their reported outcomes:

HUB’s numbers are notable not because they are the largest published, but because they are bounded. “Targeted use cases” is the operative phrase: HUB defined the workflows where AI was deployed, trained employees against those workflows, and measured impact within scope. They did not ship seats and hope. The 85 percent figure is what the upper end of structured deployment looks like inside a real enterprise.

Across these three data points — McKinsey’s $2.6T–$4.4T sizing, GitHub’s 55.8 percent velocity in greenfield experiments, and HUB’s 85 percent productivity in targeted enterprise use — the message is consistent. The gain is real, when conditions are right. The remainder of this paper is about what those conditions are, and why most organizations are missing them.

But Most Organizations Aren’t Capturing It

If the upper bound is $4.4 trillion, the median enterprise is nowhere near it.

Gartner’s July 2024 forecast — among the most-cited data points in 2025 board conversations — projects that at least 30 percent of generative AI projects will be abandoned after proof-of-concept by the end of 2025[4]. The reasons cited cluster around poor data quality, inadequate risk controls, escalating costs, and unclear business value. None of those are technical limitations of the underlying models. They are failures of organizational readiness.

The pattern is widening, not narrowing. In June 2025, Gartner extended the forecast to agentic AI specifically: 40 percent or more of agentic AI projects will be canceled by the end of 2027, citing rising costs, unclear business value, and inadequate risk controls[5]. Agentic AI — autonomous, tool-using AI systems — is the category most enterprises are now piloting as their second wave. The forecast suggests the second wave will fail at a higher rate than the first.

McKinsey’s State of AI 2025 survey (published March 2025) places the diagnosis numerically. Of the 78 percent of organizations now using AI in at least one function, only 5.5 percent qualify as “high performers” — defined as those reporting greater than five percent EBIT impact attributable to generative AI[6]. Roughly 39 percent report any measurable enterprise-level EBIT effect at all. The remainder report adoption without outcomes.

“Among 78 percent of organizations using AI, only 5.5 percent capture more than 5 percent EBIT impact. Tools have scaled. Returns have not.”

McKinsey, State of AI 2025[6]

The gap is not explained by tool quality. The same models, the same vendors, the same licenses are available across the high-performing 5.5 percent and the long tail. What separates them is what surrounds the deployment.

The capability cost of this gap is staggering. IDC’s 2024–25 research estimates the global IT and AI skills shortage may cost organizations $5.5 trillion by 2026 in delayed projects, lost productivity, and forfeited revenue[7]. In their survey, 45 percent of respondents identified AI proficiency as the hardest-to-source skill in their organization. The shortage is not of seats — it is of people who can use them.

A pattern emerges across these data sources:

SourceFindingYear
Gartner30% of GenAI projects abandoned post-PoC2025
Gartner40+% of agentic AI projects canceled2027 (proj.)
McKinseyOnly 5.5% of orgs capture >5% EBIT from AI2025
McKinsey~39% report any measurable EBIT effect2025
IDC$5.5T global skills-gap cost2026 (proj.)
IDC45% cite AI proficiency as hardest skill to source2024–25

These are not isolated findings from advocacy research. They are convergent measurements from independent firms, using independent methodologies, surveying independent enterprise populations. The gap between adoption and outcome is structural, and the structural cause is people-and-process, not technology.

This is the pivot point of the paper. The next section examines what happens inside the enterprise when tools are deployed without the structural infrastructure to use them.

The Hidden Cost: Trained Engineers Are Getting Slower

Here is the most uncomfortable finding in the 2025 literature, and the one most often omitted from vendor decks. Under the wrong conditions, AI tooling makes experienced engineers measurably slower.

In July 2025, METR — a respected nonprofit research organization focused on AI evaluation — published a controlled study of sixteen experienced open-source developers working on 246 real issues in mature codebases they knew well, using Cursor with Claude 3.5 and 3.7[12]. Each task was randomly assigned to one of two conditions: AI tools allowed, or AI tools disallowed.

The result, after careful instrumentation:

The gap between perception and measurement was nearly 40 percentage points. Developers believed AI made them faster. The clock said otherwise.

The METR finding is regime-specific and must not be over-read. The study covered experienced developers in large, mature repositories they had deep context on — exactly the conditions where AI tools have the most context to load and the most existing structure to respect. The finding does not show AI tools are useless. It shows that without skill and process, the productivity curve can invert in precisely the high-leverage senior-engineer cohort that enterprise leaders most want to accelerate.

The mechanism is documented in adjacent research. The 2025 Stack Overflow Developer Survey, with more than 49,000 respondents, shows two convergent shifts[13]:

The combination is telling. Adoption is up. Trust is down. Cleanup overhead is the dominant frustration. Developers are using these tools whether or not they have been trained to use them well — and they are paying a tax on the gap.

The cognitive mechanics of the slowdown are now well-documented in field reports:

  1. Context-loading overhead. In a mature codebase, supplying the AI with enough context to produce correct output takes longer than writing the code directly when the engineer already holds the context.
  2. Trust-but-verify cycles. A 70-percent-correct suggestion requires near-100-percent verification. The verification cost frequently exceeds the generation savings.
  3. “Almost-right” cleanup. The hardest bugs to find are the ones in code that looks plausible. AI-generated code raises the rate of plausible-looking errors.
  4. Skill atrophy on fundamentals. Engineers who delegate without internalizing lose the muscle memory required to verify the output.

“Trust in AI accuracy fell from 40% to 29% in a single year. 66% of developers spend more time fixing ‘almost-right’ AI output than they would have spent writing it themselves.”

Stack Overflow Developer Survey 2025[13]

None of these failure modes are tool defects. They are skill gaps. They are addressed by training engineers in the patterns of effective AI collaboration: when to invoke, when to verify, how to structure prompts and context, how to recognize the failure modes of the model in the engineer’s specific stack, and how to instrument the work so productivity is measured rather than guessed.

The METR study, the Stack Overflow trust collapse, and Gartner’s 30/40 percent project failure rates are three views of the same phenomenon. Tool access without structured training is not neutral. It is negative. And that is the case the next section addresses directly.

The Structural Fix: Training as the Differentiator

If the gap between tool adoption and tool outcome is structural, the structural fix is structured training. The 2025 enterprise AI research converges on this point with unusual clarity.

Begin with the BCG framing, which deserves to be the anchor of any enterprise AI conversation:

“AI value comes 10 percent from algorithms, 20 percent from data and tech, and 70 percent from people, processes, and culture.”

BCG, 2025[10]

Read carefully. BCG is not arguing that algorithms or data are unimportant — they account for 30 percent of value, which is significant. The argument is about marginal investment. The marginal dollar spent on a better model captures a fraction of the value the marginal dollar spent on training, workflow redesign, and cultural enablement does. For enterprises that have already procured the algorithms and the data infrastructure, the 70 percent multiplier is the only lever left.

BCG’s 2025 AI at Work survey, published September 2025, quantifies the multiplier. Among employees:

A 12-point gap, contingent on five hours of structured training[9]. At enterprise scale, that delta is the difference between a tool that pays back its license and one that does not.

The same BCG report tracks the enterprise outcome. “Future-built” AI leaders — the cohort defined by mature data foundations, structured AI workforce programs, and disciplined value tracking — project 2× revenue growth and 40 percent greater cost reductions than laggards over a three-year horizon[9]. Training is not the only variable in that gap, but BCG’s analysis identifies it as a primary differentiator.

McKinsey’s Superagency in the Workplace (January 2025) reaches the same conclusion from a different population. Surveying 3,613 employees and 238 C-suite executives, the study found[8]:

The signal converges across employees, executives, and HR leadership: the bottleneck is recognized. The remediation is uneven.

Deloitte’s State of Generative AI in the Enterprise (Q4 2024 / Q1 2025 wave, with 2026 follow-up data) ranks the AI skills gap as the #1 barrier to enterprise AI integration across the surveyed population[11]. The remediation strategies enterprises report:

The strategies are correct. The execution is uneven, which is why the McKinsey 5.5 percent figure persists.

“Employees rank training as the #1 factor for AI adoption. Executives identify the skills gap as the dominant cause of slow progress. The diagnosis is not in dispute. The execution is.”

Synthesis · McKinsey + BCG + Deloitte, 2025

Synthesizing across BCG, McKinsey, Deloitte, and IDC, four operating principles emerge for enterprise AI training programs:

1. Dose matters, and the threshold is measurable.

BCG’s data shows a discontinuity around the five-hour mark. Below it, AI tool usage stalls. Above it, regular usage rises and persists. Programs that deliver less than five hours per learner are unlikely to produce regular users.

2. Training is the rate-limiting input.

When 48 percent of employees identify it as the #1 adoption factor, the inference is that no other intervention — tool selection, policy, leadership communication — produces equivalent leverage on adoption.

3. The C-suite already knows.

The 46 percent of executives who identify the skills gap as the dominant blocker do not need to be persuaded of the diagnosis. They need an operationalized response.

4. Generic training underdelivers.

The gap between 53 percent of enterprises pursuing “broad workforce education” and 5.5 percent capturing meaningful EBIT suggests that breadth without depth — or training without measurement — does not move the needle. The remediation must be structured, measured, and matched to the specific workflows and stack of the deploying organization.

The structural fix is structured training. The remainder of this paper describes one operational form that fix can take.

The CETI Methodology

CETI.AI’s training program is one response to the structural problem this paper identifies. It is presented here briefly, as a worked example of what an evidence-aligned program looks like — not as a marketing pitch. Pricing, scope details, and engagement options are published openly at cetiai.co/enterprise-training.

The program is built on four design commitments, each tied directly to the research above.

1. Dose targeting above the five-hour threshold.

BCG’s data identifies a discontinuity in regular-usage rates above five hours of training per learner[9]. CETI’s foundation engagement — a focused Boot Camp of two to four days, delivered live and customized to the client’s stack — is designed to clear that threshold by a wide margin while concentrating the learning into a window short enough to be operationally absorbable.

2. Continuous reinforcement against skill decay.

Single-event training erodes. CETI’s 6-month Academy structure provides monthly working sessions, between-session asynchronous support, and a curriculum that evolves with the client’s actual workflows over the engagement window. The structure is informed by adult-learning research showing that skill retention requires spaced practice and applied repetition, not one-time exposure.

3. Customization over generic curriculum.

Deloitte’s data shows 53 percent of enterprises pursue “broad workforce education,” but only a small fraction capture EBIT impact[6][11]. Generic curriculum is a near-universal failure mode. CETI’s curriculum is rebuilt per client: the modules are configured against the client’s actual repositories, frameworks, and ticket categories. Real PRs are written against the client’s production codebase during the engagement, with the client’s reviewers, against the client’s CI.

4. Measurement as a first-class deliverable.

The pattern in the failure literature — Gartner’s 30/40 percent abandonment rates, McKinsey’s 5.5 percent EBIT capture — is consistent: enterprises cannot defend AI investment they cannot measure. CETI delivers quarterly ROI reports to leadership: baseline measurement before engagement, instrumented measurement during, and structured reporting against agreed-upon metrics (cycle time, PR throughput, ticket close rate, defect density, engineer-self-reported leverage).

The components map to a sequence:

ComponentFormatDurationPurpose
DiscoveryDiagnostic + skill survey1–2 weeksBaseline current state, identify highest-leverage workflows
Boot CampLive, intensive, hands-on2–4 daysCross five-hour dose threshold; build foundational fluency
AcademyMonthly sessions + async6 monthsReinforce, deepen, customize to evolving workflows
ElectivesTopic-specific modulesOngoingAddress specialized needs (security, agentic systems, etc.)
ROI ReportingQuarterly to leadershipOngoingDefend, refine, and expand the investment

The methodology is not unique to CETI in its components. Boot camps, Academies, and elective curricula exist in many forms. What is uncommon — and what the research suggests is the operative variable — is the combination: dose targeting, continuous reinforcement, customization to the client’s actual codebase, and instrumented ROI measurement, delivered as a single integrated program.

The next section provides a vendor-neutral implementation playbook so that the buyer of this paper can act on the diagnosis whether or not they engage CETI.

A 90-Day Implementation Playbook

The following playbook is designed to be executable by any organization with a moderately resourced engineering and L&D function, with or without external partners. It is structured in three thirty-day phases.

Days 1–30: Diagnostic

The single most common failure in enterprise AI rollouts is action without baseline. Without a baseline, no intervention can be measured, no investment can be defended, and the McKinsey 5.5 percent trap is operationally inevitable. Spend the first thirty days establishing the measurement floor.

Skill survey. Administer a structured assessment to the engineering population covering: current AI tool usage frequency, self-reported confidence, prompt and context-construction skill, awareness of failure modes, and verification practices. Segment by tenure and team. Output: a skill-distribution map identifying who is at which level, and where the population clusters.

Baseline measurement. Instrument three to five workflow metrics in advance of any training intervention. Recommended starter set:

Tool audit. Inventory current AI tool licenses, seat assignments, and active-use rates. The gap between licensed seats and active users is often 50 percent or more, and represents the first dollar of recoverable cost.

Use-case selection. Identify the three to five workflows where AI is most likely to produce measurable lift in your stack. Prioritize: high-frequency tasks, well-bounded scope, existing team consensus on what “good” looks like. Avoid vague aspirations (“make engineering faster”); pursue specific commitments (“reduce median PR cycle time on the platform team by 20 percent”).

Days 31–60: Foundation

The second thirty days deliver the intensive training event. This is the dose-threshold phase.

Boot Camp or equivalent intensive. Deliver a structured, hands-on, live training program of two to four days clearing the five-hour BCG threshold by a meaningful margin. The program should be:

Pre- and post-assessment. Re-administer the skill survey from Day 1. The delta is the first quantitative output of the program. Expect meaningful lifts in confidence and tool fluency; expect smaller lifts in verification skill, which require sustained practice to develop.

Cohort structure. Train teams together rather than individuals separately. Adoption is a team-level phenomenon — a single trained engineer in an untrained team often regresses. A trained team reinforces itself.

Days 61–90: Application

The final thirty days move from training to applied work, with the first ROI checkpoint.

Project pairing. Each trained engineer is paired with one or more in-scope projects from the use-case selection in Days 1–30. Work proceeds against real production code, real tickets, real review cycles. The training is not a separate event — it is now the engineer’s daily working pattern.

Active reinforcement. Schedule weekly 30-to-60-minute working sessions where the cohort surfaces friction, shares discovered patterns, and resolves novel failure modes together. This is the structure that converts the Boot Camp dose into durable behavior change.

First ROI checkpoint. At Day 90, re-measure the workflow metrics from Day 1. Report to leadership. Expect:

The Day 90 report is not the end. It is the artifact that earns the budget for months four through twelve, where the larger ROI is captured. The first ninety days establishes the measurement floor; the remainder of the year establishes the trajectory.

A summary of the playbook:

PhaseDaysOutput
Diagnostic1–30Skill map, workflow baselines, tool audit, use-case shortlist
Foundation31–60Trained cohort above 5-hour threshold, pre/post assessment
Application61–90Active project work, weekly reinforcement, first ROI report

This playbook can be executed with internal resources, with an external partner, or with some combination. What it cannot be — and survive the McKinsey 5.5 percent threshold — is skipped.

Conclusion: The 70% Multiplier

Return to the BCG framing.

“AI value comes 10 percent from algorithms, 20 percent from data and tech, and 70 percent from people, processes, and culture.”

BCG, 2025[10]

Two years into the enterprise generative AI cycle, the 10 percent and the 20 percent are commodities. Every enterprise has access to the same models, the same vendors, the same infrastructure stacks. The differentiator is no longer in those two layers.

The 70 percent — people, processes, culture — is where the variance lives. It is also where 78 percent of organizations have under-invested, which is why only 5.5 percent are capturing measurable EBIT[6]. The companies winning the 2026 cycle are not the ones with the best tools. They are the ones investing systematically in the workforce, measurement, and operational infrastructure around the tools.

This is the empirical claim of this paper, supported by independent research from McKinsey, BCG, Gartner, IDC, METR, GitHub, Deloitte, and Stack Overflow: structured training is the differentiator that converts tool access into enterprise value. Without it, productivity inverts in the senior cohort, projects stall at PoC, and EBIT impact remains the privilege of the top 5 percent.

For the engineering, learning, and finance leader reading this paper, the next decision is operational, not strategic. The strategic case is settled. What remains is whether your organization will be in the cohort that captured the 70 percent multiplier, or in the long tail that funded the seats and missed the return.

To begin a discovery conversation, book a 30-minute consultation at calendly.com/manutej/30min, or visit cetiai.co/enterprise-training for current program details and pricing.

Sources & Citations

  1. McKinsey & Company. (2023, June; reaffirmed 2025). The economic potential of generative AI: The next productivity frontier. mckinsey.com/capabilities/tech-and-ai/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier
  2. Kalliamvakou, E. (2022, September). Research: quantifying GitHub Copilot’s impact on developer productivity and happiness. GitHub Blog / Microsoft Research. github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness
  3. HUB International / Anthropic. (2025). HUB International brings Anthropic’s Claude to 20,000 employees, reports 85% productivity gains and 90% user satisfaction. PR Newswire. prnewswire.com/news-releases/hub-international-brings-anthropics-claude-to-20-000-employees
  4. Gartner. (2024, July 29). Gartner predicts 30% of generative AI projects will be abandoned after proof of concept by end of 2025. gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025
  5. Gartner. (2025, June 25). Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027. gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
  6. McKinsey & Company. (2025, March). The state of AI: How organizations are rewiring to capture value. mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai
  7. IDC. (2024–2025). Skills, AI, and the enterprise: Three strategies for the road ahead. idc.com/resource-center/blog/skills-ai-and-the-enterprise-three-strategies-for-the-road-ahead
  8. McKinsey & Company. (2025, January). Superagency in the workplace: Empowering people to unlock AI’s full potential at work. (n=3,613 employees + 238 C-level.) mckinsey.com/capabilities/tech-and-ai/our-insights/superagency-in-the-workplace-empowering-people-to-unlock-ais-full-potential-at-work
  9. Boston Consulting Group. (2025, September 30). AI at Work 2025: AI leaders outpace laggards in revenue growth and cost savings. bcg.com/press/30september2025-ai-leaders-outpace-laggards-revenue-growth-cost-savings
  10. Boston Consulting Group. (2025). To unlock the full value of AI, invest in your people. bcg.com/2025/to-unlock-the-full-value-of-ai-invest-in-your-people
  11. Deloitte. (2024 Q4 / 2025 Q1 / 2026 wave). State of Generative AI in the Enterprise. deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html
  12. METR. (2025, July 10). Measuring the impact of early-2025 AI on experienced open-source developer productivity. (n=16 experienced developers, 246 real issues, Cursor + Claude 3.5 / 3.7.) metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study
  13. Stack Overflow. (2025, December). 2025 Developer Survey: AI section. (n>49,000 respondents.) survey.stackoverflow.co/2025/ai