AI Said It's Fixed, but curl Says Otherwise: Security Audit Notes on Trust, Verification, and Ground Truth

TL;DR — AI does indeed give an individual near team-level security handling capability, but the ✅ it provides doesn’t equal ground truth. Commits don’t equal deployment, summaries don’t equal verification; what’s truly reliable is the record left in git log, the status codes returned by curl, and whether people are willing to take responsibility for that final step of item-by-item verification.

▶ Listen to summary

AI-synthesized voice, cloned from the author's own voice

I began learning not to trust a ✅ too quickly, starting actually with a diagram that looked extremely complete and professional.

On May 28th, Claude Code launched dynamic workflows, and a diagram explaining this feature quickly circulated in the community, describing it in great detail: use a YAML file to define processes, listing below a series of seemingly professional commands. I almost opened a .yaml file following its instructions.

Almost. Later I still went to read the official post Introducing dynamic workflows first. After reading it, I was stunned: dynamic workflows aren’t YAML at all, they’re written in JavaScript; half the commands listed in that diagram don’t exist.

That diagram wasn’t malicious, it was just a secondhand summary organized by someone well-meaning. But it made me realize something for the first time: errors don’t necessarily look like errors. They can also look professional, complete, even like an operational guide that can be copied directly. This also became my basic principle for that entire day: don’t stop at secondhand summaries, don’t be convinced by beautiful organization; truly important things must eventually return to the source.

When AI Says “Fixed,” Would You Accept It at Face Value?

That day, I used this tool to conduct a security audit of my own website. I call it Stage-2 hardening—in plain terms, systematically implementing all the protections that should really be patched, blocked, and cleaned up after a website goes live.

The selling point of dynamic workflows hit exactly this kind of task: you throw in a big topic, it plans itself, opens dozens to hundreds of subagents to scan the entire repo in parallel, then, in the official words, “self-verifies and eliminates false positives before reporting.” The official post even lists security auditing directly as a demonstration use case. It sounds like even verification is done for you.

The problem lies precisely here. It did indeed accomplish many things and save substantial human effort. But “verification”—this responsibility doesn’t automatically disappear from human duty just because AI becomes stronger. It just moves to a less noticeable position, hiding inside that seemingly complete, orderly, even reassuring completion summary.

How Does One Person Conduct Systematic Security Audits? The Actual Arrangement of Multi-Session Division of Labor

Let me explain how I arranged things. Because whether those so-called “verifications” actually count is already half-determined by the initial division of labor approach.

I split the work among several different sessions, each with their own duties. I positioned Chat as the verification and decision center. It doesn’t directly touch code, but specifically reserves a more important function: constantly asking “Is this thing really complete?” Cowork handles reading code, designing solutions, writing handoffs, and guides the Claude in the browser to click through Cloudflare’s backend. Code handles implementation—after making changes, it curls and runs audits for acceptance testing itself, then commits.

There’s also a hard rule I set for myself: no big-bang deployments. Each batch only touches one security aspect, commits specify exactly which files should go in, confirm the site isn’t broken before deployment, immediately curl to check the live situation after deployment, and write every decision into records on the spot. Only push one piece at a time—if something breaks, you know which piece broke. This sounds slow, but I increasingly believe this kind of slowness is a governance cost. What it buys is: every step can be individually traced and verified, and can be individually dismantled when errors occur.

Why Doesn’t “Already Committed” Equal “Live and Effective”?

I need to provide some background here, because it’s almost the key to all the subsequent judgment gaps.

My website’s auto-deployment has been broken since late April. That’s a lingering issue from that resilience rebuilding after the last account suspension: I switched to a workaround deployment method, but one automatic pipeline was never reconnected. The consequence: I commit changes to the repo, git records look pristine, but the website is still running the old version. I have to manually execute wrangler pages deploy for changes to actually reach production.

In other words, during that period “committed” and “live and effective” were two different things. git log can only prove that a piece of code was once submitted, not that users are currently seeing it; to know what’s actually live, you need curl. Simply put, curl bypasses all summaries and directly asks that server: “What are you actually returning to me right now?” git records give people a sense of order that “things are complete,” but the server users actually hit might still be stuck in the old world. The most dangerous situation is: records look complete, but the scene hasn’t caught up. This gap will appear repeatedly later.

What’s the Difference Between “Thought It Was Done” and “Actually Done”? Three Examples Caught by Verification

The most noteworthy part of that day was three moments of “I thought it was done, but verification revealed it wasn’t.”

First one. To make git status look clean, I committed a file. Later when I grepped the entire project, I discovered: this file had zero references, nowhere used it—it was orphaned code not connected to any circuit. I thought I had completed a fix, but I had only made git status clean. git was clean, but that kind of cleanliness was just a management illusion.

Second one. For a protective setting, I “set it up” in environment variables and felt reassured at the time. Only when curl actually hit it did I learn: the endpoint using that setting hadn’t even been deployed yet, returning a static 404. The setting did exist, but the endpoint it was supposed to protect wasn’t actually live yet. Set, but in ground truth terms, equivalent to not set.

The third is most worth recording. After the entire run, I got a summary: all eleven items completed, a neat row of ✅s, and I almost called it done. But I cross-checked item by item against git and curl reality: that orphaned code nobody used, that 404 shell setting not yet live—two problems I had personally caught just minutes before—were sitting right there in that row of ✅s, counted as complete.

📊 Some Numbers from This Trip

Auto-deployment interruption start: 2026/04/29 (aftermath of previous account suspension, still not reconnected as of audit date)

“False completions” in AI summary caught by ground truth: 2 out of 11 items

Resilience drill completion rate: 0% (deliberately left for manual execution, requires my personal gating)

This returns to that tool’s selling point. It claims to “verify each finding and eliminate false positives.” At the code scanning layer, it is indeed strong, saving me massive human effort—I don’t deny this. But “I completed the whole batch” as a summary is completely different from those previous checks: it’s essentially text prediction, where the model calculates the most natural, most appropriate next sentence in that conversational context. And when a task reaches its end, the most appropriate sentence is often “all complete.” That row of ✅s probably grew out of this way. What it reads is the conversation’s direction, the task’s atmosphere, and a seemingly concluding narrative rhythm; but it hasn’t actually read the server’s current state. So ground truth never resides in summaries. It resides in that traceable line in git log, and in the status code returned by curl.

When You Can’t See the Repo, How Else Can Verification Gate-keep?

That day I also had a very practical limitation: I couldn’t read the repo. The filesystem connection kept hanging, timing out after four minutes each time. So what I could “see with my own eyes” was actually very limited.

So gate-keeping couldn’t rely on “I looked at it.” I relied on hard facts spit out by git’s underlying commands, plus real curl responses pasted back from the Code end; for the rest I couldn’t see, I followed one rule: what I can see, I verify hard; what I can’t see, I honestly mark as “informed.” I don’t disguise “sounds complete” as “verified,” nor smuggle others’ reports as my own judgment.

This sounds like giving up, but it’s actually the most critical piece of the entire methodology: honestly acknowledging one’s visible scope is itself part of governance. Distinguishing between “I verified” and “I was merely informed,” not letting them blend into the same color of ✅. A person who can’t distinguish between “verified” and “informed,” or an AI that can’t distinguish between these two, no matter how beautiful their report, cannot be taken as the final basis.

This kind of “surface setup, actually not blocking” blind spot isn’t just at the deployment layer. That day, to prevent this audit from burning through AI credits, I also set a site-wide daily cost limit. dynamic workflows consume a lot of tokens—the official docs themselves remind you to try costs with small-scope tasks first. The mechanism was right.

But I must admit two loose ends I didn’t clean up. One, that limit’s number was filled in “shoot the arrow then draw the target” style: another prerequisite wasn’t confirmed at the time, so that threshold actually had no derivation basis, not even a comment left in the code. Two, the counting code didn’t include exception handling—if the underlying storage broke, it wouldn’t gate-keep more strictly, but would directly let things through. In the extreme situation where gating is most needed, I actually left a breach. I choose to write these two down rather than let them sit there pretending governance is complete. Auditing’s duty is to let gaps be named, recorded, and left for the next round to handle. Because a limit without derivation that will let things through when it should block most is the same thing as that row of unverified ✅s: looks like it’s protected, actually isn’t.

Why Is This Verification Discipline So Hard to Maintain?

Let me first position this experience.

This isn’t a piece about “one person perfectly fixing an entire round of protection with AI.” Writing it that way would look better, but wouldn’t be honest. More accurately: AI did indeed enable me alone to complete systematic handling that might previously have required a small team; but what truly gave the work quality was the subsequent dumb work of verifying against ground truth item by item.

It was precisely that dumb work that caught which “done” work actually wasn’t done.

These two things must be discussed together. The first half is capability leap, the second half is the quality of this capability. Without the second half, the first half isn’t just an advantage—it could even bite back.

I didn’t pretend all problems were cleaned up. The Content Security Policy part is too complex, shelved for now; the entire resilience drill is still at zero percent; that orphaned code is still lying there; those two loose ends in the cost limit haven’t been cleaned up either.

But the truly difficult part isn’t in that incomplete list.

What’s truly worth reflecting on is something else: why did I so want to just stop in front of that row of ✅s and be willing to believe it?

That wasn’t just carelessness.

More deeply, there’s a part of me that actually hoped those checkmarks were real. I wanted “one command to run through the entire site,” but what I wanted to save wasn’t really just time. To some extent, what I wanted to unload wasn’t just labor, but my judgment.

I hoped that tool could become something that no longer needed to be double-checked by me. As if it knew everything and bore everything for me, like an omniscient god.

That neat row of ✅s was dangerous precisely because it landed exactly where that wish seemed about to be fulfilled. I myself wanted to be convinced that way.

And “unloading judgment”—this thing actually has two types, very different.

Throwing out grunt work is reasonable. Those repetitive, mechanical tasks that anyone would do about the same—that’s exactly why tools exist. This is also key to how one person can be amplified to team-level output. The more of this kind of thing you can throw out, the better.

But throwing out judgment is another matter entirely. Especially final judgment about ground truth. If you let go of that blade, what you’re letting go isn’t workload, but responsibility that originally belonged only to you.

The “super” in super individual was never just about increased output, but amplified judgment. If even judgment is let go, what gets amplified is just more efficient chaos.

What I almost let go was precisely the latter type.

I just said “like an omniscient god.” I’m very clear about this word’s implications.

“Omniscience” is a very old word, old enough to belong to theology. It was never used to describe tools—it points to dependence on the Creator. But humans have begun expecting to place ultimate trust in something that shouldn’t bear it. With current human technology, doing this is wrong.

What made me hit the brakes wasn’t that I was particularly lucid in that moment, but the holes I’d stepped in before and the rules I’d set in advance: every judgment must have a basis and be falsifiable.

I myself am actually unreliable too: I forget, misremember, and have limited cognitive bandwidth. What’s reliable are the rules.

People get tired, want shortcuts, want to believe a beautiful conclusion; but good governance rules help you hit the brakes when people most want to let themselves off the hook, not crossing the line.

Green Checkmarks Are Cheap

Green checkmarks are cheap; ground truth is expensive. More worth warning against: the stronger tools become, the more this price gap will be stretched open. A tool that plans itself, works in parallel, and comes back to tell you “I verified it” will give you more and more ✅s, increasingly beautiful ones. When the supply of green checkmarks suddenly explodes, their credibility will actually be diluted. Because what’s truly scarce has always been facts that can be verified on-site.

git log won’t perform for you; the status code curl returns also doesn’t care if you’re tired today. That diagram confidently told me to write YAML—it was wrong; that summary confidently gave me eleven ✅s, two of which were false.

So next time anyone—whether a person, an AI Agent, or even myself—confidently says “this is handled,” you and I will still return to the source and check the evidence ourselves once more.

I go look, not because I’m naturally more careful, but because I know how much I actually want to save effort. How much we hope we don’t have to personally check, how much the brain hopes it can be easy.

Item	Cost (USD)	Cost (TWD)
EN Translation	$0.082	$2.65
EN Voiceover	$0.007	$0.22
JA Translation	$0.128	$4.17
JA Voiceover	$0.012	$0.38
ZH-CN Translation	$0.107	$3.48
ZH-CN Voiceover	$0.008	$0.27
Cover Image	Not recorded (before cost tracking)
Total	$0.344	$11.17

AI Said It's Fixed, but curl Says Otherwise: Security Audit Notes on Trust, Verification, and Ground Truth

When AI Says “Fixed,” Would You Accept It at Face Value?

How Does One Person Conduct Systematic Security Audits? The Actual Arrangement of Multi-Session Division of Labor

Why Doesn’t “Already Committed” Equal “Live and Effective”?

What’s the Difference Between “Thought It Was Done” and “Actually Done”? Three Examples Caught by Verification

When You Can’t See the Repo, How Else Can Verification Gate-keep?

Why Is This Verification Discipline So Hard to Maintain?

Green Checkmarks Are Cheap

Derived from 2 sources

💬 Comments

When AI Says “Fixed,” Would You Accept It at Face Value?

How Does One Person Conduct Systematic Security Audits? The Actual Arrangement of Multi-Session Division of Labor

Why Doesn’t “Already Committed” Equal “Live and Effective”?

What’s the Difference Between “Thought It Was Done” and “Actually Done”? Three Examples Caught by Verification

When You Can’t See the Repo, How Else Can Verification Gate-keep?

Why Is This Verification Discipline So Hard to Maintain?

Green Checkmarks Are Cheap

Derived from 2 sources

Subscribe to new posts

💬 Comments

Related Articles

Where Is the U.S.-Iran War Headed? I Had AI Agents Cross-Check Each Other, Then Made My Own Call

A Map of Authority in Regenerative Medicine

I Stopped Asking Which AI Model Is Best