Back to blog
Jan 28, 2026
8 min read

I Built Software Entirely with Claude Code. Here's What Actually Happened.

An honest experiment in AI-assisted development: one day, one tool, and a working (but brittle) result.

I built a working API in one day using Claude Code with zero hand-written code.

The code works. It’s also crude, slow, and brittle. This is an honest assessment, not a hype piece.

The Experiment

I wanted to test how well AI can replicate existing software functionality. Not greenfield development where requirements are fuzzy. Pure replication where success is measurable.

The project is commercetools-promo-preview. An API that previews promotional discounts with time-travel capabilities. You give it a cart and a target date. It tells you what discounts would apply.

Why this project? It’s complex enough to be a real test. Discount logic, predicate parsing, multiple discount types (percentage off, fixed price, gift items, multi-buy, BOGO). But there’s a clear reference implementation. The real commercetools API. Claude could compare its output to known-correct results.

MCP Access Was the Game-Changer

Claude had MCP (Model Context Protocol) access to the actual commercetools API.

This is where being API-first matters. commercetools is built on MACH principles (Microservices, API-first, Cloud-native, Headless). That’s not just an architecture slide for a conference talk. It means the entire platform is accessible through well-documented APIs. commercetools already offers its own MCP servers and AI tooling. The platform was ready for this experiment before I even thought of running it.

That’s the advantage of tech-forward infrastructure. When a new paradigm shows up, API-first platforms don’t need to retrofit. They already speak the language. Claude didn’t need a special adapter or a screen-scraping hack. It connected to the same APIs that power production storefronts.

This meant Claude could read the official documentation, test against the real implementation, compare its output to expected results, and self-correct when things didn’t match.

Here’s the key insight. Duplicating functionality means less planning. Claude could reverse-engineer rather than design from scratch. Every time it wasn’t sure if something was correct, it could just ask the real API.

Not Quite Autonomous

There’s a lot of hype right now about fully autonomous AI coding. Ralph is getting mainstream attention for doing exactly that. The pitch is simple. A bash loop feeds Claude’s output back into itself. while :; do cat PROMPT.md | claude-code ; done. Brute force meets persistence. Clone commercial software for $10 an hour. Y Combinator startups are reportedly adopting the technique.

It’s a compelling narrative. It’s also not the full story.

I chose direct supervision instead of letting Claude run autonomously. Partly because MCP gave Claude access to external systems. Real API calls. Real data. I prefer more control when AI has real-world access.

But also because I wanted to see what actually happens when you try this. The “$10 to clone software” framing omits a critical detail. You can clone functionality. You can’t clone quality. My experiment produced working code in a day. It also produced code that’s brittle, slow, and architecturally crude. The cost of making it production-ready would dwarf the cost of generating it.

The Workflow: Guidance Every 10 Minutes

Claude built its own plans and PRDs based on documentation. It wrote tests before implementation. It ran those tests and fixed failures.

But it required intervention roughly every 10 minutes.

Types of guidance needed:

  • Encouragement to continue when uncertain
  • Redirects when going down wrong paths
  • Clarification on edge cases
  • “Yes, that’s right, keep going”

Not fully autonomous. But dramatically less effort than writing it myself. I spent my time steering, not typing.

What Got Built

The numbers tell part of the story:

  • 527 tests across 22 suites
  • Tech stack: Hono 4.x, TypeScript 5.9, Zod, Vitest
  • Features: percentage discounts, fixed price, gift items, multi-buy, BOGO
  • Full predicate DSL parsing for commercetools discount conditions

The API handles complex discount scenarios. Stacking rules. Cart-level vs line-item discounts. Custom predicate evaluation.

Here’s what a typical test looks like:

describe('PercentageOffCalculator', () => {
  it('should calculate percentage discount correctly', () => {
    const calculator = new PercentageOffCalculator();
    const lineItem = createLineItem({
      price: 10000,
      quantity: 2
    });

    const result = calculator.calculate(lineItem, {
      permyriad: 1500 // 15% off
    });

    expect(result.discountedAmount).toBe(3000);
  });
});

Claude wrote hundreds of these. Each one testing specific discount behavior against the reference implementation.

The Results: It Works, But…

It works. The API correctly calculates promotional discounts with time-travel. You can ask “what discounts would apply to this cart on Black Friday?” and get accurate results.

It’s brittle. Any change in the commercetools API will break this implementation. There’s no abstraction layer to absorb upstream changes. It’s a snapshot of current behavior, frozen in code.

It’s crude. The architecture is functional but unrefined. Meets requirements but wouldn’t scale in complexity. Adding new discount types would require touching too many files.

It’s slow. Slower than the actual commercetools API, even running on localhost. No caching. No optimization. It works, but it wouldn’t survive production traffic.

The code hits the requirements but would struggle to scale.

What This Tells Us About AI Coding Tools

Good at:

  • Replication and pattern-following
  • Test-driven development
  • Reading and implementing from documentation
  • Grinding through repetitive logic

Needs help with:

  • Elegant architecture
  • Performance optimization
  • Anticipating future requirements
  • Knowing when to stop and refactor

The sweet spot is well-defined tasks with clear success criteria. Tests passing. Matching a reference implementation. “Make this output equal that output.”

This connects to something I wrote about previously. AI is becoming an abstraction layer in programming. It handles the boilerplate. It grinds through the tedious parts. But humans are still needed for judgment calls.

When should we optimize? What’s the right abstraction? How will this need to change? These questions require context that AI doesn’t have.

AI Is an Accelerator, Not a Replacement

A great software developer who knows how to build a feature can build it faster with AI. The quality stays similar because the developer is still making the architectural decisions. AI handles the typing. The human handles the thinking.

But AI also compounds bad decisions. Every shortcut, every ignored abstraction, every “just make it work” gets baked deeper into the codebase. AI won’t push back. It won’t say “this approach will hurt us in three months.” It will happily dig a tech debt hole as deep as you let it.

This is the gap that trips people up. Without real development experience, the basics are easy. Hello world works on the first try. A landing page comes together in minutes. It feels like anything is possible. Then complexity shows up and everything changes.

This is why you see vibe coders posting about how “AI has gotten worse” or “they nerfed Model XYZ.” The model didn’t get worse. Their project got more complex. On day one, AI can build features fast because the codebase is simple. By day thirty, the codebase is a mess of compounding decisions nobody consciously made. AI struggles not because it’s less capable, but because the project has become harder to work in.

They don’t see the difference in complexity. They just see AI being slower at adding features compared to day one and blame the model. The real problem is their own project. The tech debt. The missing abstractions. The architecture that was never intentionally designed because AI doesn’t do that on its own.

You need actual skills to keep a project going past the prototype phase. AI can’t do it alone. My experiment proved that in a single day. The code works. Scaling it would require a developer who understands why it’s brittle and knows how to fix that. AI built the house. A human still needs to make sure it won’t fall down.

Would I Do This Again?

For an experiment or prototype? Yes, absolutely. One day of steering beats weeks of typing. The result was good enough to validate the concept.

For production software? Only with significant refactoring afterward. The code needs architectural review. Performance work. Proper error handling. All the things that separate “it works” from “it’s ready.”

The honest take: AI didn’t replace me. It was a force multiplier that still needed steering. Like having a very fast junior developer who never gets tired but also never questions whether the approach makes sense.

The question isn’t whether AI can build software. It’s whether you’re comfortable with the tradeoffs. Speed vs elegance. Working vs maintainable. Done vs done right.

For this experiment, I was comfortable with those tradeoffs. The result is a working API that proves the concept. It’s also a codebase I’d be nervous to maintain long-term.

That’s the honest answer. AI-assisted development is real and useful. It’s also not magic. You still need to know what good software looks like. Because AI will happily build mediocre software all day long if you let it.