How to Apply Consistent Agentic AI Workflow to Team Projects

TL;DR
AI subscriptions don't automatically increase team productivity.
Designing workflows that preserve individual know-how as team assets is the key role of leaders in the AI era.
I propose implementing a consistent workflow at the team level that operates with step-by-step context separation + deterministic verification (CI) + non-deterministic judgment (LLM) + small units of change.

Overview

In today's world where coding agents have become essential development tools, it's actually difficult to find developers who don't use AI. Companies are keeping pace with this trend by subscribing to OpenAI, Claude, and Gemini, encouraging their active use.

However, subscribing to expensive AI services doesn't automatically improve productivity.
According to METR Research, using AI coding tools actually increased development completion time by 19%. Developers expected a 20% reduction, but this wasn't the case for everyone.
On the other hand, a developer known as Programming Zombie, frequently mentioned on social media, successfully developed and monetized 350 apps using AI. A Chinese developer EastonDev refactored 10,000 lines of legacy code in just 14 days, improving test coverage, bugs, and performance metrics.

Why such productivity differences when using the same tools? When individuals use AI on their own, results vary greatly depending on their understanding of the tools, accumulated experience, and effective utilization methods. The issue I've experienced while leading development teams is precisely this point – the variance in AI utilization capabilities across developers and development organizations. AI subscriptions raise the upper limit of individual capabilities, but this doesn't guarantee improved productivity for the entire team.

In this article, I'll discuss methods I've considered for converting individual AI utilization capabilities into team-wide capabilities.

Limitations of LLMs and Harnesses

alt text

Since the limitations of LLMs are already well-known, I won't cover them in depth. However, let's review the fundamental limitations that must be recognized when assigning tasks to transformer-based AI.

1. Context is finite

No matter how large the context window becomes, there are still limitations for tasks requiring extensive context. It remains difficult to understand an entire large codebase or handle refactoring across dozens of files at once. To address this, we need to devise separate methods for effectively conveying context.

2. Outputs are probabilistic

The same prompt can produce different results each time. This is due to the fundamental generation method of LLMs. While this is an advantage for creative tasks, it becomes a critical disadvantage for tasks requiring consistency.

3. Hallucinations are unavoidable

LLMs confidently generate incorrect information. In coding contexts, they call non-existent APIs, suggest deprecated syntax as if it were current, or create code that imports libraries that don't exist. The problem is that these hallucinations often appear plausible. Trusting AI output without verification can lead to major incidents that might only be discovered at runtime, with compile errors being the least of your worries.

Harness: A Practical Solution

Among various methods to compensate for these limitations, Harness (agent structures based on Tool Use) currently shows the most product-level completeness. Coding agents like Claude Code, Cursor, and Windsurf have adopted this approach.

However, Harnesses have an expiration date.

The insights from the Bitter Lesson and Scaling Laws remain valid. A new model from Google or OpenAI could suddenly emerge, rendering the carefully constructed Harness useless. For example, extracting text from PDFs once required complex pipelines, but now you can simply feed an image to a multimodal model. A PDF parsing Harness that took weeks to build can become legacy overnight.

Why We Should Still Build Harnesses

Despite these risks, the immediate productivity improvements provided by Harnesses cannot be ignored.

The productivity gap between using raw LLMs and working in a well-configured agent environment has already widened significantly. Even if it becomes obsolete in six months, if the productivity gains during those six months exceed the construction costs, it's worth building.

The challenge is making this Harness consistently usable by the entire team, not just individuals.

Individual AI Capability ≠ Team AI Capability

There are many individuals who use AI effectively. However, whether the team that individual belongs to uses AI well is an entirely different matter.

Providing expensive AI subscriptions to team members doesn't automatically increase team productivity. There are structural reasons why individual AI utilization doesn't translate into team-level productivity.

1. Human Intelligence is the Bottleneck

AI's code production speed far exceeds human review speed. According to ITWorld's analysis, this is like "increasing the speed of just one machine on an assembly line while leaving the rest unchanged—the factory doesn't speed up, unprocessed work just piles up." If code is generated 10 times faster but reviews still need to be done manually by humans, human reviewers inevitably become the bottleneck.

2. Lack of Verification Systems

How trustworthy is AI-generated code? According to the CodeRabbit report, AI-generated code produces 1.7 times more issues per PR than human-written code. Without objective verification metrics and automated quality gates, we cannot have confidence in AI outputs.

3. Wide Skill Gaps

Some people produce high-quality results with sophisticated prompts and optimized agent settings, while others struggle with basic utilization. Even with the same tools and subscription fees, productivity gaps can widen several times over. Narrowing this gap has limitations through individual effort alone.

4. Experience and Know-how Evaporate

This is the most serious problem. Team members doing similar work create similar prompts and attempt similar agent configurations. Even when someone discovers an effective method, that knowledge remains with the individual. Tips shared on Slack get buried after a few days, and guides organized in Notion don't get updated. Experience and know-how about using AI effectively don't accumulate within the team but evaporate.

What do these four problems have in common? The absence of a workflow that accumulates and shares AI capabilities.

As long as we rely on individual capabilities, the AI utilization level of the entire team will inevitably be inconsistent. What's needed is a structure where individual experiences accumulate as team assets, and verified workflows are consistently applied to all team members.

What Leaders Should Do in the AI Era

Going forward, all technical leaders must design structures where individual experiences accumulate as team assets. This isn't just applicable to the AI Era. However, in the AI era, this issue has become more acute.

The leader's role is to actively build Harnesses and integrate them into team workflows. Here are five principles for doing so:

1. Separate Context by Workflow Stage

Don't try to solve everything with one massive prompt. Planning review, design, implementation, testing, review - each stage requires different contexts. By delivering only the appropriate context for each stage, you can efficiently utilize LLM's finite context window and improve the quality of results.

2. Distinguish Between Deterministic and Non-Deterministic Tasks

Not everything needs to use LLMs.

Deterministic tasks are rule-based tasks that should always produce the same results. Linting, formatting, static analysis, type checking, and security scanning fall into this category. Using LLMs for these tasks only increases unnecessary costs and uncertainty. Traditional tools are faster, more accurate, and more consistent.

Non-deterministic tasks require contextual understanding and judgment. This is where LLMs excel:

Tidying: Small organizational tasks like improving variable names and removing unnecessary duplication
Reviewing: Detecting potential bugs, pointing out performance issues, finding convention violations
Documentation: Writing code comments, READMEs, API documentation, CHANGELOGs
Test generation: Writing unit tests, deriving edge cases, expanding test coverage

Delegate deterministic tasks to CI pipelines and focus LLMs on non-deterministic tasks.

3. Keep Change Scope Small

Just because AI can generate thousands of lines at once doesn't mean it should generate thousands of lines every time. Large changes increase cognitive load for reviewers and create bottlenecks. Changes should be broken down into small units that can be adequately verified and easily rolled back if problems occur.

This doesn't mean keeping everything small unconditionally. The key is to find a scope that can be automatically verified without cognitive load.

In Tidy First?, Kent Beck proposes the concept of 'Tidying', which is smaller than refactoring but more meaningful than linting. For example:

Unfolding nested conditionals with guard clauses
Replacing magic numbers with descriptive variable names
Removing dead code
Reordering functions

Changes of this scale can be merged without separate review as long as tests pass. With a well-designed workflow, AI can automatically perform, verify, and apply such Tidying tasks.

4. Create Workflows That Minimize Human Intervention

The bottleneck is ultimately human. If human review speed cannot keep up with AI production speed, we need automated workflows that can verify AI outputs with minimal human intervention.

Typically, verification workflows are structured hierarchically:

Level 1: Deterministic Verification (CI Pipeline)

Passing linting, formatting, type checking
Passing the entire test suite
Security scanning, dependency vulnerability checks

Level 2: Non-deterministic Verification (AI Reviewer)

Reviewer agent analyzes changes when a PR is created
Detection of potential bugs, performance issues, architecture violations
Summarization of key points in PR changes and improvement suggestions

Level 3: Scope-based Automatic Approval

Small Tidying-level changes + passing Level 1/2 verification → automatic merge
Large changes that trigger versioning → review creation

Here, Conventional Commits rules can provide hints to agents. By enforcing commit message types like feat:, fix:, refactor:, chore:, docs: and ! (breaking change) indicators, AI can clearly determine the nature and scope of changes.

chore: remove unused imports      → can be automatically merged
refactor: separate payment logic  → automatic merge after AI review
feat!: change auth API response   → separate review process needed

With this configuration, changes at the chore, style, docs, refactor level that don't trigger versioning can be merged directly by the reviewer agent if they pass Level 1/2 verification. Only breaking changes like feat!, fix! or significant changes like feat need a separate review process.

If this workflow reaches a level of sophistication where even relatively large changes can be merged by review agents without human intelligence intervention, most changes will eventually be processed automatically without human involvement, and the team/project's productivity will reach a major inflection point.

5. Ensure All Improvements Accumulate as Team Assets

This is the most important point.

AI adoption is not a matter of "just buying good tools." The 2025 DORA Report defines successful AI adoption as a system issue rather than a tool issue, stating that AI's value is locked into the surrounding technical and cultural environment rather than the tool itself.

If someone discovers an effective prompt, that prompt should be reflected in tools used by the entire team rather than evaporating. If someone creates a workflow that prevents mistakes, that workflow should become a team system rather than an individual habit.

Best practices discovered by individuals → become team standard workflows → are version controlled → and continuously improved.

For this structure to work, organizations must have a clear and shared AI stance (policies/expectations/allowed tools/scope of application). The DORA Report suggests that the positive effects of AI adoption depend on the existence of such a "clear and communicated AI stance," and when this exists, the positive impact on individual effectiveness and organizational performance is amplified.

Enabling this is the core role of leadership in the AI era.

In the next part, we'll look at specific methods for implementing these principles using Claude Code's Skills, Hooks, and Plugins.