Your Agency's Billing Model Wasn't Built for AI.

One of our developers came to me last week with a simple question: “How do I log my hours?”

She had a clean Wednesday afternoon with no meetings and no distractions. She set up AI agent sessions for two client projects, prepared the context and prompts, kicked them off in parallel, and then stepped back. Over the next few hours, she checked in periodically to review output, course-correct, and approve results.

Here’s what the agent sessions produced:

Project A, ~3.5h of agent session time:

Seasons feature: status enum, query filters, endpoint restructuring, test updates
Event listing: fix lazy loading, handle null configs, simplify auth
General: restore resources after merge, update gitignore
Specification: implementation proposals for upcoming tasks

Project B, ~4h of agent session time:

Fleet management: entity modeling, admin CRUD, sync command, serializer, scheduler, DB import

That’s 7.5 hours of agent output. The Dev’s active involvement (upfront planning, reviews, a handful of interventions mid-session) added up to roughly 1.5 hours. The rest of the time, she was “free” as the agents were running. She could have been working on something else, or not working at all. Her question was honest: “There’s really no way to log this even slightly precisely, at least not as we used to. How do I log this?”

She’s right, and if you’re running an agency on time-and-materials contracts, this is about to become your most urgent operational problem.

Why does this hit agencies hardest

Product companies can somewhat absorb this shift internally. If an engineer produces 3x more output, the product ships faster, but the billing relationship with users doesn’t change (at least not as directly as with agencies, competition will affect this later on).

Agencies don’t have that luxury because the entire business model is built on a chain of assumptions:

Hours logged → ManDays calculated → Client invoiced

Every link in that chain assumes that one hour of developer time equals roughly one hour of developer output. Time and materials pricing, the most common model in agency work, makes this assumption explicit. The client is literally paying for hours (or days).

AI agents break that assumption fundamentally. When a developer can set up two parallel agent sessions, do 1.5 hours of real cognitive work, and produce 9 hours of total output in about 4 hours of wall-clock time, the concept of “an hour of work” stops meaning what it used to mean.

The options (and why none of them fully work)

When we sat down to figure this out, we found five approaches (I’m sure there’s more) that each make sense from a certain angle but each has a real flaw. And one long-term direction that’s probably right, but doesn’t solve next Monday’s timesheet.

Option 1: Log real hours only

The Dev did about 1.5h of active work, so log 1.5 hours and split proportionally across projects based on session weight.

Project	Session time	Ratio	Logged
Project A	3.5h	47%	0.7h
Project B	4h	53%	0.8h
Total			1.5h

This is honest and simple. The client pays for real human time, and the developer logs exactly what she worked on.

The problem: you’re delivering 9 hours worth of output and billing for 1.5. Over time, the agency captures none of the value from adopting AI. Your regular costs stay the same (salaries, tooling..) and you get the extra cost of AI subscriptions and token usage, but your revenue per unit of output drops dramatically. You’ve made your team more productive and made yourself less profitable.

Option 2: Log session hours

The two agent sessions ran in parallel for about 4 hours of wall-clock time, producing 3.5h and 4h of output, respectively. On top of that, the Dev did 1.5 hours of active cognitive work: the upfront planning, the reviews, the interventions. If you log it all that’s 3.5 + 4 + 1.5 = 9 hours across the two projects.

This captures the value of the output better, and arguably it still undersells it, since a well-run AI agent session often produces more than a single developer would in the same wall-clock time.

The problem: you’re billing AI compute time at a human rate. When a client looks at the invoice and realizes that 7.5 of those 9 hours were produced by a machine, the natural question is “Why am I paying your developer’s day rate for an agent’s work?”

And that’s a reasonable question.

It also creates an incentive to run more sessions and review less carefully, since more sessions mean more billable hours, and quality becomes the casualty.

Option 3: Log wall-clock time

The Dev was dedicated to these two projects for about 4 hours. Even during the stretches when agents were running autonomously, she was on standby: monitoring output, mentally allocated, ready to intervene. That’s similar to how consultants bill for being in a room even when they’re not the one presenting. Split the 4h proportionally:

Project	Session time	Ratio	Logged
Project A	3.5h	47%	1.9h
Project B	4h	53%	2.1h
Total			4h

This is defensible (“she was working on your project for 2 hours”), doesn’t blow up into impossible numbers on a full workday, and is easy to track.

The problem: it still undervalues the output compared to what a non-AI developer would have billed for the same scope, and it’s somewhat dishonest in the other direction since part of that “availability” time might have been spent on something else entirely.

Option 4: Equivalent manual hours

Estimate what the delivered work would have taken without AI and log that. If the fleet management feature would have been a full day of manual work (8h), log 8h to Project B. This preserves the client’s expectations from the pre-AI world: they see familiar-looking numbers for familiar-looking deliverables, and the total feels reasonable for the scope delivered.

The problem: it’s subjective and requires the developer to estimate backwards, which introduces guesswork. Two developers might estimate very differently for the same work, and over time this can drift away from reality in either direction.

Option 5: Real hours with a multiplier

Log the 1.5h of real work but apply a transparent AI productivity multiplier (say 2x or 3x), billing 3h or 4.5h. This could be formalized in the rate card: “Our AI-augmented rate is X/hour, reflecting that each hour of senior developer time produces Y hours of equivalent output.” It’s similar to how agencies already charge different rates for junior vs. senior developers, just extended to account for AI tooling.

The problem: finding the right multiplier is tricky since it varies by task, by developer skill, and by project complexity. And clients may push back on the concept: “So you’re charging me triple because a robot is doing the work?”

The long-term direction: redefine the billing unit

Stop selling hours and start selling delivery capacity, sprint outcomes, or value-based packages, essentially redefining what a “ManDay” means in the context of AI-augmented teams.

This is probably the right answer. The client pays for what gets delivered, not for how it gets produced. An AI-augmented ManDay delivers more than a pre-AI ManDay, and the pricing should reflect both the increased value to the client and the investment (tooling, skills, process) the agency made to get there.

But it requires renegotiating contracts, re-educating clients, and figuring out pricing for something nobody has standardized yet. It doesn’t solve next Monday’s timesheet, which is why the options above still matter in the meantime.

The part that looks like “not working” is the hardest part

There’s a temptation to look at The Dev’s Wednesday afternoon and think it was easy. 1.5 hours of real work, most of it planning and reviewing, with long stretches of free time while agents ran in the background. That doesn’t look like a developer hunched over a keyboard writing code for eight hours straight.

But here’s what those 1.5 hours actually contained: understanding the codebase well enough to write effective prompts, breaking complex features into pieces an agent can handle, reviewing generated code for correctness and architectural fit, catching edge cases the agent missed, knowing when to intervene and when to let it run. Every one of those actions requires deep technical judgment.

Steve Yegge captured this perfectly in his recent essay “The AI Vampire”. His argument is that AI has automated the easy parts of software development, the actual writing of code, running tests, scaffolding features, and left humans with only the hard parts: decisions, reviews, quality judgment, architecture calls, and knowing when something is wrong.

Yegge goes further and argues that this kind of concentrated decision-making is only sustainable for about 3 to 4 hours a day. Not because people are lazy, but because it’s genuinely exhausting. His proposed solution is simple: the new AI-augmented workday should be 3 to 4 hours of focused cognitive work. “It may involve 8 hours of hanging out with people. But not doing this crazy vampire thing the whole time. That will kill people.”

The Dev’s 1.5 hours of afternoon focused work fits this pattern precisely, and the question isn’t whether she should have worked harder but what those 1.5 hours are worth.

And how do you even estimate anymore?

This isn’t just a billing problem, it’s an estimation problem too.

When a project manager asks “how long will this feature take?”, the answer used to be grounded in a shared understanding: one developer, writing code, for X hours. You could calibrate estimates against past experience because the unit of work was stable. And even this was famously unprecise.

Now that unit is unstable. The same feature might take 4 hours of wall-clock time, 1.5 hours of developer effort, and produce 9 hours of session output, and it’s genuinely unclear which of those numbers goes into the estimate or gets promised to the client.

If you estimate in developer effort (1.5h), the client sees a suspiciously small number and either questions the quality or wonders why it costs so much per hour. If you estimate in session output (9h), you’re back to the same problem as billing: the numbers don’t map to human time anymore.

And there’s a deeper issue: estimation accuracy depends on predictability, and AI agent sessions are less predictable than manual development. Sometimes a session nails a feature in one pass, and sometimes it takes three rounds of review and correction, so the variance is higher and estimation actually gets harder, not easier.

We’re still working this out, but the direction seems clear: estimates need to shift from “hours of work” to “scope of delivery.” Instead of promising the client 40 hours this sprint, you promise a set of deliverables, and how your team produces them, whether through 40 hours of manual work or 15 hours of AI-augmented work, becomes an internal detail.

The throughput ceiling is human, not technical

In theory, if 1.5 hours of active work produces 9 hours of total output, you could just stack more sessions. Run five projects in parallel instead of two. Push toward 20 hours of output per day.

In practice, this breaks quickly because the bottleneck was never the AI.

On a different day, the same developer worked about 6.5 active hours managing sessions across the same two projects. The sessions produced roughly 9 hours of output. She described her process: “I have a lot of checks, from tests to reviews to fixes and several passes through the entire code. That gives me top results while I work on something else. But it takes time.”

That’s the key phrase: “top results.” The quality that makes agentic development valuable depends entirely on the human maintaining standards. Each agent session needs real review. Run too many sessions in parallel, and you start rubber-stamping instead of reviewing.

There’s a natural throughput ceiling, and it’s set by human cognitive capacity, not by how many agent sessions you can spin up. Push past it and you don’t get more output, you get more rework, more production incidents, and eventually burned-out developers who stop catching the things that matter.

The value question agencies need to answer

This all boils down to one question: who captures the value that AI creates?

Yegge frames this as a spectrum. On one end, the company captures 100% of the value where the developer works at 10x productivity for the same salary, burns out, and gets nothing for it. On the other end, the developer captures 100% where they do 1 hour of real work, match their peers’ output, and coast. Neither extreme is sustainable, and the answer has to sit somewhere in the middle.

For agencies billing T&M, the same spectrum applies but with a third party involved, which is the client. If you bill 1.5 hours for 9 hours of output, the client captures most of the value and your agency’s margins shrink even as your team gets better. If you bill 9 hours, the agency captures the value, but the client is effectively paying pre-AI rates for post-AI work, which holds up only as long as they don’t realize what’s happening.

The client should get more value per ManDay than they did before AI, and the agency should be more profitable per developer than they were before AI, so that both sides benefit and the pricing reflects reality.

The billing model will change.

We haven’t solved this completely, and I don’t think anyone has yet. Every agency using AI for development is going to face this. The T&M model, where hours are the unit of value, doesn’t map cleanly to a world where one developer with good AI workflows can produce what used to take two or three people.

You can keep logging hours and let margins erode, or overbill and hope nobody notices, or you can start having honest conversations with your clients about what you’re actually delivering and what that’s worth.

The agencies that figure this out first will have a real advantage. Not because they’ve gamed the system, but because they’ve built a pricing model that’s sustainable for their team, fair to their clients, and reflects the actual value of AI-augmented delivery.

We’re still figuring it out ourselves. If you’re dealing with the same thing, I’d be glad to compare notes.