Building an AI Native ERP With Claude Code: Spec First Methodology
How a 33 section spec, 18 constitutional financial laws, and Claude Code shipped a 48 module open source ERP a 200 person team would take a decade to build.
The first artifact of ERPClaw was not code. It was a 9,766 line specification, 33 sections long, written in plain English. Every table (191 of them at v1, 789 today), every action (1,095 at v1, 3,148 today), every naming convention, every validation rule, every test scenario was defined before a single line of Python existed.
That document is the reason an AI native ERP exists today as a 48 module open source system instead of as a half finished demo. The lesson I want to share in this post is not “Claude Code is amazing” (it is, but that is not the interesting part). The lesson is that AI native is not AI decorated. Decoration is bolting a chat sidebar onto a SaaS product you already shipped. AI native is changing how the software gets built, tested, and priced from the first commit.
If you are a CTO, an engineering manager, or a founder thinking about how to compete with vendors that have a ten year head start, the rest of this post is the playbook I used. It is opinionated, it is reproducible, and it is what made the build economics work.
My background, and why it matters here
I spent eleven years rolling out SAP and other enterprise systems at Accenture and as an independent architect. I have sat in the rooms where a Fortune 500 retailer paid $50 million for an ERP rollout that ran 18 months late, and I have written the requirements documents that 200 person delivery teams then took 18 months to translate into something resembling working software.
That experience shaped two convictions. First, the bottleneck in enterprise software is almost never coding speed; it is requirements clarity, cross team coordination, and the cost of fixing things that should have been specified up front. Second, mid market companies (the 40 to 500 employee shops) are systematically underserved by the SAPs and Oracles of the world, because the per seat economics do not work below a certain scale.
When Claude Code matured into a tool I could trust for production work in late 2025, both convictions became actionable. The coding bottleneck collapses. The cost of building equivalent scope drops by an order of magnitude. The mid market vacuum is suddenly addressable by a single architect with a good spec and a $20 a month server. That is what ERPClaw is. The point of this post is the methodology, not the product.
Spec first development, in concrete terms
Most teams treat a specification as a starting point that gets revised heavily during implementation. In spec first development, the specification is the contract. Code is generated from it. When the spec changes, the code is regenerated. When the code drifts from the spec, the code is wrong, not the spec.
For ERPClaw, the spec lives in three layers:
- The master plan. The original 9,766 line document covering data model, action catalog, GL semantics, naming conventions, test scenarios, and module boundaries. Nothing in the code exists that is not described here.
- Per module SKILL.md files. Each of the 48 modules has a YAML fronted markdown file under 300 lines that lists every action, its parameters, its return shape, and its tier (basic, intermediate, advanced). This is what Claude Code reads when it generates new actions or fixes existing ones.
- The Constitution. 18 machine readable financial laws (described below) that any module must satisfy, regardless of who or what wrote it.
A SKILL.md entry for a single action looks roughly like this:
- name: submit-sales-invoice
tier: intermediate
description: Submit a draft sales invoice, posting GL and updating SLE atomically.
args:
invoice_id: { type: string, required: true, format: uuid4 }
posting_date: { type: string, required: false, format: date }
returns:
journal_entry_id: string
gl_balanced: boolean
invariants: [gl_debits_equal_credits, ar_subledger_matches_control]
That spec block is the source of truth for four audiences at once: Claude Code (which generates the implementation), the test suite (which generates contract tests from the schema), the web dashboard (which renders forms from it), and the human reading the docs. A traditional ERP frontend needs about 150 lines of form code per action. Across 3,148 actions that is roughly 470,000 lines of UI code that simply does not need to exist when the spec drives the surfaces.
The discipline this enforces is what makes the AI assisted coding work. A well specified action produces working code on the first generation about 90 percent of the time. An underspecified action produces plausible looking code that fails on edge cases about 90 percent of the time. The 20 percent of time spent on the spec saves 80 percent of the debugging.
The Constitution: 18 financial laws, auto validated
The riskiest thing about using AI to build accounting software is that the AI will happily write a general ledger posting function that uses floating point arithmetic. Your trial balance will be off by a penny after a thousand transactions, and nobody will notice until the auditor does.
The fix is not “tell the AI not to do that.” The fix is a constitutional rules engine that rejects any code (human or AI written) that violates a financial law. ERPClaw has 18 of these laws, each expressed as an executable assertion, each enforced at test time across the entire codebase. A short selection:
- Article I: No floats for money. Every monetary column is TEXT, every Python value is a
Decimal. The validator scans schema definitions and source code; anyREALorfloatnear a money name fails the build. - Article III: GL is immutable. The
gl_entrytable has noupdated_atcolumn. Cancelling a posting creates a mirror reversal entry, never an update. The validator confirms no UPDATE statements target the GL tables. - Article V: Atomic submission. Every submit action wraps its writes in a single SQLite transaction. The validator parses every submit handler and confirms a
BEGIN ... COMMITboundary surrounds the cross table writes. - Article IX: Twelve step GL validation. Every posting passes through a 12 check pipeline: balanced, no nulls, party set on AR/AP, fiscal year open, account active, currency consistent, and so on. The validator confirms the pipeline is invoked.
- Article XII: Trial balance integrity. After every test run that touches the GL, total debits must equal total credits across the entire database. If the global invariant fails, every GL touching test in the run fails.
If any single article is violated, the offending module cannot ship. There is no human override. This is the regression proof bit: it is impossible to accidentally break double entry bookkeeping and have green tests, because the invariant engine runs over the whole database after every relevant test, not over isolated unit assertions.
This is what makes ERPClaw safe to extend at AI speed. The AI does not need to be perfect. The Constitution will catch it when it is wrong.
The 80/15/5 theory
Building ERPClaw taught me that ERP module development decomposes into three layers with very different automation profiles:
- 80 percent mechanical. Schema creation, CRUD action scaffolding, naming conventions, audit logging, parameter validation, list and get endpoints. This work is repetitive, well specified, and identical across modules. AI does it perfectly when given the spec.
- 15 percent pattern matching. GL posting patterns, cross module integration glue, report templates, common workflows like draft to submit lifecycle. AI does this well when given the right examples and a checklist of invariants. Human review catches the cross module surprises.
- 5 percent human judgment. Domain edge cases (a partial payment against a multi line invoice with a discount), business specific UX decisions, regulatory subtleties, the questions the AI does not know to ask. This is irreducibly human.
The mistake most teams make with AI coding is treating it as a uniform 100 percent. Either they trust it for everything (and ship floats in money columns) or they distrust it for everything (and waste a decade of free productivity). The right model is to automate the 80, assist the 15, and reserve human attention for the 5 that matters.
This is also what destroys the cost structure of vendors who are not AI native. A 200 person engineering team that spends 80 percent of its hours on mechanical work cannot compete on price with a small team whose mechanical work is automated. The Kodak parallel is exact: it is not that the new product is better, it is that the cost base of the incumbent is no longer defensible.
What Claude Code does well, and where I override it
After fifteen months of using Claude Code as the primary implementation surface for an open source ERP, here is the honest assessment.
It excels at translating well specified business rules into working code. Give it a SKILL.md entry, the relevant table schema, and the Constitution, and it produces a passing implementation on the first try the vast majority of the time. It does not get bored on action 800. It maintains naming consistency across 48 modules over weeks of sessions, as long as the spec stays consistent. It is also excellent at test scaffolding, schema migrations, and the unglamorous refactoring tasks (renaming a column across 48 modules, updating a shared library signature) that consume disproportionate human time.
The failure modes matter more than the wins, because they tell you where to spend human attention. Cross module dependencies break first; intercompany invoicing required heavy manual correction because the AI optimised each module locally and missed the global invariants. Edge cases not covered in the spec are a guaranteed regression source: GL reversals with partial payments, garnishment priority ordering, multi currency revaluation when the rate changes mid period. The fix is to add them to the spec the moment you discover them.
Security awareness is approximately zero by default. Claude Code will happily ship your home directory path in an error message or a real Indian taxpayer ID in seed data. I caught 21 such issues in a single audit pass across 220 files; every one was functionally correct and contextually careless. The fix is a security audit of the output, every time, treating AI generated code as if it came from a brilliant but careless junior engineer.
How I keep the AI honest
Building an open source ERP the size of ERPClaw means generating a lot of code, often unsupervised. The trust model that makes that safe rests on five layers of automated checking, each of which can fail a release independently.
L0 constitutional tests (270 tests). The 18 articles, plus completeness checks (every Python action documented in SKILL.md), plus structural checks (every module has the required files in the required places).
L2 contract tests (3,088 tests). Generated from the SKILL.md specs. Every action is invoked with valid and invalid inputs and the response shape is checked against the schema. This catches drift between spec and implementation immediately.
L3 smoke tests (248 tests). End to end scenarios that exercise full workflows: quote to cash, procure to pay, hire to retire, manufacturing run with WIP accounting.
Invariant engine (23 checks). Runs after every test that touches financial data. Trial balance balanced, balance sheet equation holds, GL chain hash sequential, no NaN in any financial column, every cancellation has a matching reversal. If any invariant fails, every test in the run fails.
Six gate session pipeline. Local validation, server deploy, vertical install, natural language smoke test, GL integrity, CI status. The session gate is the last line of defence before code reaches a user. The full gate description is on the quality page.
The cumulative effect is a regression prevention checklist that runs in seconds, scales with the codebase, and does not depend on a human remembering to run it. It is one thing to ship 48 modules in a sprint. It is another to keep them shipping correct GL postings six months later, after another 500 actions have been added.
The spec is the source of truth, the code is regenerable
The deepest implication of spec first development for an AI native codebase is that the code is no longer the asset. The spec is the asset. The code is a derived artifact.
This sounds esoteric until you watch it play out. When I added a new region (UK PAYE and NI), I did not write a regional payroll module from scratch; I added the regional rules to the spec, regenerated the affected actions, ran the constitutional and contract tests, and shipped. The thinking was already done. When I migrated from one library structure to another, I did not refactor 48 modules by hand; I updated the spec, regenerated the import patterns, and let the test suite tell me what was wrong. A migration that would have been a multi week project on a hand written codebase was a long afternoon.
This regenerability changes the economics. Your competitor’s codebase is the thing they cannot afford to throw away, because they paid millions to write it. Your codebase is something you can rebuild from the spec in a weekend if you find a better architecture.
What this means for non AI native competitors
A vendor who built a comparable ERP between 2010 and 2025 is now sitting on a code asset with three properties: it cost a great deal to build, it is expensive to maintain, and it is locked into the architectural choices of its era. Their per seat pricing is a function of all three.
An AI native ERP built on the spec first methodology has none of those properties. The build cost was an order of magnitude lower. The maintenance cost is bounded by the spec and the test suite. The architecture can be regenerated when the underlying tools improve.
The implication for pricing is the part most incumbents have not yet absorbed. ERP, CRM, project management, HR, invoicing: these are commodity problems with public business rules. The code was the moat, and the moat was the cost of writing it. That cost is now collapsing. Open source AI native systems will eat commodity SaaS the same way Linux ate proprietary Unix. The vendors who survive will sell domain expertise, regulatory compliance, distribution, and trust. Not code.
What two weeks of AI assisted building looks like
The original ERPClaw sprint was 14 days, working solo, with Claude Code as the primary implementation surface. Day one was the spec, in full, with no code written. Days two through eleven were two modules per day on average: GL and journals first, then supply chain, operations, payroll and HR, then intelligence and compliance. Days twelve through fourteen were the testing overhaul, the clean install gate, and the security audit that caught 21 findings I should have caught the first time.
The point is not the timeline. The point is the answer to the question every CTO needs to answer: what does your 200 person engineering team do for 18 months that a single architect with a good spec and Claude Code cannot do in two weeks? The answer, mostly, is coordination. AI eliminates the coding bottleneck. Small teams eliminate the coordination bottleneck. Together, that is the order of magnitude.
FAQ
Is Claude Code production ready for building real software?
Yes, with the right scaffolding. Claude Code on its own is a brilliant junior engineer with no instinct for what should not ship. Wrap it in a constitutional rules engine, a contract test layer, and a session gate, and it becomes a production capable implementation surface. We have shipped 3,148 actions across 48 modules with this setup.
What is the difference between AI native and AI decorated?
AI decorated is bolting a chat sidebar onto a product architected in 2015. AI native is treating AI as the primary implementation layer from day one, which changes the data model (metadata driven), the test strategy (invariant engines), the pricing model (no per seat economics), and the team shape (3 to 5 people, not 200).
Why spec first instead of just prompting harder?
Because prompts do not version, do not test, and do not survive across sessions. A spec is a versioned, testable, reviewable artifact that drives multiple surfaces (AI, API, UI, docs) from one source. Prompt engineering is a tactic; spec first is an architecture.
How is the Constitution different from a linter?
A linter checks syntactic patterns. The Constitution checks semantic invariants across the entire system after the code runs. Article XII (trial balance integrity) cannot be enforced by a linter, because it requires running the test suite and inspecting the resulting database. The Constitution is closer to a property based test framework specialised for financial software.
Can I see the spec and the code?
Yes. ERPClaw is MIT licensed and the entire codebase, SKILL.md files, Constitution, and test suite are public on GitHub. The original HackerNoon piece covers the broader story; this post is the engineering view.
Where to go next
If you are a developer or technical buyer evaluating whether AI assisted coding holds up under the demands of real financial software, the best entry point is the developers page. It walks through the SKILL.md format, the Constitution, and the contract test layer with code samples. The quality page covers the test pyramid and the session gate in detail. The features overview is the product surface; the ERPClaw OS page covers the self extending architecture that grew out of the spec first methodology.
If you want to use the system, clawhub install erpclaw is the one command install. The pricing page is short, because the answer is zero for the open source edition. Cloud managed comes later in 2026 for teams that want a hosted instance.
If you are a CTO evaluating what AI changes about the way you ship software, I would rather hear your skepticism than your applause. The methodology in this post is testable. The codebase is open. Reproduce it, break it, or improve it. That is what an AI native open source ERP is for.
Related posts
A2X Alternative: The Free Open-Source Tool Most Shopify Stores Don't Know About
Looking for an A2X alternative? ERPClaw uses the same clearing account method, books every order separately, includes per-warehouse stock costs, and costs $0. A founder's honest comparison.
Agency Accounting, Project P&L, and Time Billing: A Real Guide
Why your friendly retainer client is secretly losing you $40 an hour, the four metrics every agency owner should know, and how to fix project P&L.
The Clearing Account Pattern: Shopify and Stripe in Plain English
The clearing account pattern is the trick that makes Shopify and Stripe deposits actually balance to your sales. Here is how it works, in plain English.