TL;DR — Two months ago I embodied Karpathy’s autoresearch as a “let paulkuo.tw mutate its own surface files automatically” system. Seven weeks later it had silently failed. A three-way deliberation across four sessions revealed that a personal IP site shouldn’t pursue fully autonomous self-optimization. Autoresearch’s right embodiment for a personal IP context is distributed autoresearch via Chat-Cowork-Codex-Code-Paul five-party deliberation — I had been running it implicitly all along, just hadn’t named it. This very piece of writing is one of those cases.

In early March 2026, Karpathy released autoresearch: a tiny autonomous research loop that lets an AI agent experiment on a single-GPU nanochat training setup. The mechanism is plain — the agent edits the training script, runs a short training pass, checks whether the result got better, keeps it if so, discards it if not. It’s not a “chatbot demo” but a prototype that compresses the research process into an automatic loop.

I had been implicitly running something parallel for a month already. Not nanochat — paulkuo.tw’s “site surface auto-mutate”: let an LLM agent run a loop reading five whitelisted files (llms.txt / siteSchema.ts / mcp.json / agent-card.json / robots.txt), propose a mutation, let Cloudflare Pages rebuild, let an eval-worker score AI Comprehension, keep the change if the score improved, revert if not. I thought I understood his vision. Turns out I only understood half.

The sketch I had on the whiteboard during that time was straightforward: four scoring layers (llms.txt / JSON-LD / MCP+A2A / AI Comprehension) + a reward function (climb from 85 to 95) + a mutation pool (let the agent pull a mutation from the pool each day). Looking at that sketch I felt this was Karpathy’s v1, embodied for my personal IP scenario — he runs model training, I run website optimization, same underlying spirit.

I remember the feeling vividly. That kind of “finally found the right metaphor” certainty, paired with the “automation will free me” thrill. I gave the system a name — AI-Ready Continuous Optimization System — and wrote an article saying paulkuo.tw was becoming a self-evolving site. Published on 3-22. The system was still running then — though I didn’t know it, the previous day had been its last successful mutation.

Seven weeks later I discovered experiments.json had been 0 bytes all along

On 3-23 commit bfacd167 truncated experiments.json to 0 bytes. The last successful mutation was 3-21. After that everything was mutation_failed — prompts over 200k tokens, agents couldn’t even get the reward signal.

But the system never errored. No alert. No cron fail. It just stopped producing new experiments and sat there quietly. I thought it was running.

By the time I looked back on 5-12, it had been silently dead for seven weeks.

The most dangerous gap in governance isn’t the spectacular failure. It’s this kind of silent death — a system you believe is operating, that has long ago stopped operating, and you don’t even know it has stopped.

But the seven-week stall was just the symptom. When I handed this question to Chat / Cowork / Codex for three independent investigations, what they revealed was something more upstream — three blind spots: autonomy is overrated; the ceiling for machine-side optimization is low; “infrastructure for dialogue” ≠ “optimizing the website.”

No top-tier personal IP site is running a mutation engine

I had thought a mutation engine was autoresearch’s right embodiment for a personal IP scenario. Aligning with industry data, I realized I had confused two paradigms.

ALLMO saw something striking — among the AI-cited sources it analyzed, the share of sites carrying llms.txt was vanishingly small. SE Ranking ran the same analysis across nearly 300,000 domains and arrived at the same place: there’s currently no visible correlation between llms.txt and AI citation frequency. This doesn’t mean llms.txt has no value. It means it’s not yet a growth lever your reward function can chase directly.

📊 Key Data

  • ALLMO analysis: of 120 AI-cited websites, less than 1% carried llms.txt
  • SE Ranking analysis: nearly 300,000 domains; no statistically significant difference in AI citation rate between sites with/without llms.txt
  • paulkuo.tw mutation engine experiment: 5 whitelisted files; 2 of 5 paths were wrong (public/llms.txt doesn’t exist; public/mcp.json is actually at .well-known/)

My read: ALLMO and SE Ranking are saying the same thing — llms.txt is hygiene, not moat. It might be useful for some future AI crawler, but today’s systems aren’t relying on it to decide whether to cite you. Google’s John Mueller has directly compared it to “the keywords meta tag” — at least at the time, no major AI service had committed to using it, and server logs don’t show consistent crawling.

Even more telling: after Mueller’s keywords-meta-tag analogy, Google’s own Chrome for Developers / Lighthouse documentation then folded llms.txt into its agentic browsing audit, calling it an emerging convention for LLMs and AI agents — while the same document explicitly notes that not providing llms.txt is currently N/A, not an error. That means it looks more like “potentially-important hygiene” than already-verifiable moat. For a personal IP site to put it into its reward function is to mistake an industry convention that hasn’t yet converged for an optimizable core signal.

That’s the real lesson from the ALLMO report: the ceiling for machine-side optimization is low, and it’s too far from the reward. You spend a month building a 5-file mutation engine, and the < 1% in the ALLMO data won’t change because of you.

Vercel v0 isn’t a counterexample. Paradigm mismatch is

Someone might say: isn’t Vercel v0 exactly LLM-auto-mutating-websites? The product is successful. Why can’t a personal IP site adopt the same pattern?

v0 isn’t a counterexample. It’s a product. In August 2025, Vercel turned v0.dev into v0.app, repositioning it from “generate UI” to “use agentic intelligence to research, plan, debug, and build a full app.” Its reward signal comes from whether users actually ship the app, deploy it, and keep paying for it — those signals are dense, immediate, and they close the loop.

But a product and a personal IP site aren’t the same paradigm.

A personal IP site’s reward is “invitation quality” and “density of collaborators reaching out unprompted” — sparse, delayed, often only computable month over month. To transplant v0’s mutation pattern onto a personal IP site is to apply a product paradigm to a hygiene scenario where the reward signal isn’t dense enough to close the loop.

What I should have been benchmarking against is the next-level direction Karpathy added on X shortly after releasing v1— autoresearch needs to go multi-agent, asynchronous, massively collaborative, SETI@home-style, with many different explorations happening at once:

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it’s to emulate a research community of them.

In other words, the point isn’t to emulate one lonely PhD student — it’s to emulate a research community: multiple agents exploring different hypotheses on different threads, with the effective ones promoted upward.

What I was running was the v1 single-thread paradigm. Karpathy pointed at where v2 should go shortly after v1’s release — but I didn’t read that tweet, ran my own loop for a month, and let it die silently for seven weeks.

I had been implicitly running v2 all along

Looking back at the reframing itself, I realized something: the conclusion of this article wasn’t reached by one person.

On 5-12 I threw “should the AI-Ready site mutation system be retired?” to three sessions in parallel:

  • Chat session entered from a philosophical angle — proposed a three-identity framing (personal IP / tool sites / writing site with different reward structures), cited ALLMO and Mueller, wrote a conceptual reframing report.
  • Codex session entered from an engineering audit angle — listed 13 cautionary tales + 7 deprecated paradigms previously chased, provided an Autonomy A0-A4 framework.
  • Cowork session entered from a local investigation angle — ran 11 reconnaissance items (A1-A11), caught 2 of 5 whitelist paths being wrong in its own system, discovered mcp.json transport: client-side is declarative not executable.

Three parties, independent, no cross-talk, each reported back. Then I as PM converged and Paul made the final call. The whole process from issue raised to ADR Accepted took four hours.

This is distributed autoresearch via five-party deliberation. The community here isn’t 1000 external agents; it’s the five-party deliberation table internal to paulkuo.tw — Chat, Cowork, Codex, Code, Paul.

Each party has its own strengths and its own blind spots. Chat is good at conceptual reframing and external research; Cowork is good at local reconnaissance and file-level facts; Codex is good at engineering audits and failure-mode cataloging; Code is the verifiable state; Paul does judgment, trade-offs, and final calls. The value isn’t “who’s smarter” — it’s that they don’t share the same blind spot. Cowork on its own would never catch the ALLMO research that Chat raised; Chat on its own would never catch the specific local fact that Cowork grep’d out (“2 of 5 whitelist paths wrong”).

I had been running it. Just hadn’t named it. Karpathy wrote v1 for the community; I’m writing v2 for the five-party deliberation table.

The deeper insight: the three-identity framing is the structural cause of the seven-week stall. My personal IP site, tool sites (Agora Plaza / Builder’s Scorecard / claude-usage-nyan), and writing site (84 articles, four languages) — these three identities have fundamentally different reward structures, yet they were forcibly bundled under “let paulkuo.tw self-evolve” inside the same mutation engine. Category confusion.

Tool sites have product metrics built in, fit autonomous loops. The writing site can be split — Schema.org coverage is hygiene that can be automated, quality audit requires human-in-the-loop. A personal IP site’s reward is too sparse for autonomous loops. To bundle three different reward structures under one reward function — the result was seven weeks of silent failure, and me believing it was running.

Looking back, the Builder’s Scorecard case has nice contrast tension. Builder’s Scorecard is a tool site, with product metric built in (monthly evaluation completion rate, score distribution, retention) — running an autonomous loop is natural there; the agent sees a dimension’s score stuck at some plateau and can propose its own mutation experiment. But the same paradigm applied to the paulkuo.tw personal IP site, the reward signal is too thin for the agent to receive — “did this article get a collaborator reaching out within three months” is the kind of signal the agent never receives, so the loop never closes. From SDTI (Saint Dominic Trade Institute) doing circular economy, to CircleFlow doing EPR compliance systems, to paulkuo.tw writing essays — each identity has a different reward structure. My theological training taught me to see “the essence of things,” one strand of which is “don’t smush different paradigms into the same vocabulary.” I broke my own commandment.

The seven-week silent stall was the embodiment of that structural error.

Karpathy wrote v1 for the community; I’m writing v2 for the five-party deliberation table

Retiring the mutation engine isn’t retiring the autoresearch inspiration. It’s a carrier migration.

Karpathy’s autoresearch v1 for the community: single agent, single thread, single domain, open-source on GitHub for the world.

My autoresearch v2 for the five-party deliberation table: multi-agent, multi-thread, multi-domain, Chat-Cowork-Codex-Code-Paul collaboration.

I wrote this reframing into an ADR, used git mv and git rm to retire ai-ready-opt/ and .github/workflows/ai-ready-opt.yml, wrote a retired-notice with ADR link, and kept the git history so anyone wanting to reentry has the full path. The whole Phase 1 was one atomic commit: 55 files, +405 / -225.

The previous article from 2026-03-22 had the thesis “sustainable optimization = distinguishing effective signal from noise.” I haven’t negated that piece — it still stands. But what stands is the range it covers: “given you’ve chosen the right paradigm, optimize the method.” This article is saying something one level up: if the paradigm itself is wrong, no amount of method optimization will converge.

The piece on AI Agents vs Agentic AI made the point that the evolution from task-tool to agentic-partner is in the framing, not the tech. Looking back now, five-party deliberation is the embodiment of agentic AI in a personal IP scenario. The piece on developing a realtime translator with one human plus AI recorded a collaboration experience that was an early prototype of the same paradigm — at the time I didn’t realize I was running distributed autoresearch; I just felt “working with AI goes more smoothly.”

Autonomy isn’t a mutation engine. Autonomy is discernment.

The seven-week stall wasn’t just an incident — it was a governance signal: the most dangerous system isn’t the one that fails spectacularly, it’s the one that silently fails while you believe it’s still running.

It reminded me: I thought I was running autoresearch. What I actually needed wasn’t an agent that automatically mutates the website — it was a deliberation table where different agents correct each other, surface each other’s blind spots, and let a human make the final judgment.