The Death of Man-Days: We Need New Productivity Metrics for the AI Collaboration Era

Abstract (TL;DR) — Man-days measure attendance, not value creation. This article proposes the AI Collaboration Portfolio five-dimensional framework (Command / Delivery / Leverage / Quality / Influence), paired with a three-tier anti-fraud evidence architecture, providing enterprises with the first verifiable AI talent evaluation method. Complete case validation included. → Free assessment of your AI collaboration level

40 Minutes to Complete 15 Man-Days of Work: What Can Man-Days Still Measure?

If 40 minutes of deep thinking can create output equivalent to traditional 15 man-days, then are our methods of measuring work already obsolete?

Three AIs running simultaneously. One analyzes market data and produces entry strategies, another restructures backend architecture based on the strategy and deploys prototypes, and a third integrates the outputs of the first two into an executable operational plan. The operator spent approximately 40 minutes: defining problems, allocating tasks, making real-time directional corrections, and ensuring final quality. When finished, there was a fully functional product prototype, a market entry strategy analysis, and an automated operational plan on the desk.

In the past, this would have required a cross-departmental team of five collaborating for three to four days. 15 to 20 man-days.

This isn’t the norm—it’s a glimpse of what happens when boundaries are pushed to extremes. But the trend it foreshadows is structural: when AI dramatically lowers the execution threshold for standardized tasks, output is no longer determined solely by time invested, but by how the operator defines problems, allocates tasks, and ensures quality. This isn’t about making humans disappear; it’s about elevating human roles from “operators” to “curators.”

So what exactly does “man-days” measure here? It measures time, not value. When measurement methods lag behind reality, we’re not measuring productivity—just whether people show up (attendance).

Anthropic’s Economic Index released in January 2026 provides empirical foundation. They analyzed over one million Claude conversations and found that instruction quality correlates with output quality at 0.92. The significance behind this data deserves reflection: in the AI era, “problem definition ability” directly equals “final performance.” Tasks requiring college-level comprehension see 12x AI acceleration; high school level only 9x.

AI doesn’t uniformly strengthen everyone—it exponentially amplifies capability gaps.

Two people’s resumes both say “familiar with AI tools,” but that might just be the same sentence corresponding to two completely different output capabilities. One person can ship a complete full-stack application over a weekend using AI, while another might only engage in multiple rounds of questioning, generate numerous summaries and NotebookLM notes, appear very busy, but ultimately produce no truly usable results. The problem isn’t just the capability gap itself, but that we still have no systematic way to recognize, measure, or even name this gap.

When Hours No Longer Equal Output, How Do Enterprises Rebuild Performance Measurement Frameworks?

In 1975, computer scientist Fred Brooks published The Mythical Man-Month. His most classic insight was debunking the “man-month” myth: people and time were never interchangeable variables. Adding each person not only increases capacity but simultaneously increases communication, coordination, and integration costs—costs sufficient to consume the originally imagined efficiency gains. So when a project is already behind schedule, adding more people isn’t usually a remedy—it makes things slower.

Brooks was talking about human teams. But the AI era validates his argument from the opposite direction: AI dramatically lowers coordination thresholds for standardized tasks. No onboarding needed, no alignment meetings, 24-hour continuous work, nearly zero switching costs. When coordination costs approach zero, parallel execution becomes truly efficient for the first time—but measurement systems built on the assumption that “coordination is expensive” simultaneously lose their meaning.

Some attempt to replace man-days with “attention duration.” This concept divides human cognitive investment in AI collaboration into four categories: initiation attention (translating vague requirements into precise instructions), supervision attention (checking AI output and making real-time corrections), integration attention (coordinating distributed outputs from multiple AIs), and incubation attention (letting unconscious background thinking settle and catalyze insights).

Such categorization is helpful, but “attention duration” is essentially still an input-side indicator. It tells us how much cognitive resource was invested but doesn’t tell us what was actually produced, the quality, or how much value was created compared to traditional methods.

The current problem is: the input side has “attention duration,” the macro level has the Anthropic Economic Index estimating AI’s contribution to US labor productivity annual growth at approximately 1.0 percentage point; but at the individual level of output—the level where hiring decisions occur, where freelancers are evaluated—it’s almost completely blank.

The measurement vacuum isn’t at the margins—it’s at the center.

Everyone’s Testing AI Literacy—Why Isn’t Anyone Testing Individual Output?

This vacuum isn’t because no one cares. Over the past two years, governments, international organizations, and academia have produced numerous AI capability frameworks. But they’re all solving the same problem: do you know how to use AI? None are asking: what did you actually create with AI?

Anthropic’s collaboration with Rick Dakan and Joseph Feller produced the AI Fluency Framework, defining four core capabilities: Delegation, Description, Discernment, and Diligence. This is probably the closest thing we have to answering “what counts as being good with AI.” But it describes what good AI collaboration looks like—it doesn’t quantify what it produces.

Anthropic’s own Economic Index approaches from another angle, analyzing millions of real conversations to estimate AI’s impact on labor productivity. The data is extremely valuable, but it operates at the level of national economics, not at the level of your company’s next hiring decision.

The US Department of Labor released an AI Literacy Framework, the UK Turing Institute launched the third edition of “AI Skills for Business Framework,” UNESCO has AI capability frameworks for students and teachers, and academia has proposed scales like Collaborative AI Literacy and Collaborative AI Metacognition.

Many frameworks exist, but they all answer the same type of question: do you understand AI, can you use it reasonably, do you know its limitations? In other words, they test literacy—whether your AI knowledge is sufficient. But what enterprises really need to know is performance: what did you actually create with AI, what’s the quality of results, and how much faster were you really compared to traditional methods?

Global AI Capability Framework Positioning Chart: Lower left quadrant crowded with literacy-type frameworks, lower right quadrant (individual performance) remains blank

Plot existing frameworks on a 2×2 matrix (horizontal axis: “literacy vs performance,” vertical axis: “individual vs macro”), and you’ll see the left side crowded with people (UNESCO, DOL, Turing, various academic scales), upper right corner has the Anthropic Economic Index. The lower right corner—the individual-level, performance-based, verifiable quadrant—is empty.

Everyone is testing AI literacy. No one is testing individual AI productivity. This article proposes a framework to fill this gap. This continues the same thread as my previous piece on AI-era capability gaps: that piece discussed the existence of the gap, this one discusses how to quantify it.

AI Collaboration Portfolio Five-Dimensional Model: What Kind of People Does Your Team Really Need?

AI Collaboration Portfolio is a five-dimensional model measuring what individuals actually produce through AI collaboration. Design principle: each dimension must have theoretical foundation, quantifiable indicators, and verifiable data sources from third parties.

But before looking at indicators, imagine you’re conducting an interview. The candidate sitting across from you says “I’m really good with AI”—how do you determine if this statement is true or false?

To distinguish whether someone “really knows how to use AI,” enterprises actually only need to look at five things: can they command AI correctly (Command), what substantial results have they actually delivered (Delivery), how much capacity can they amplify in the same time (Leverage), does the quality of output hold up (Quality), and have their methods been adopted by others (Influence).

Let me elaborate on each.

Command (25%) | Solving “People Who Can’t Ask Questions”

Do you have this type of person on your team: tools installed, accounts opened, but output still relies on manual work? The problem isn’t the tools—it’s command. Command measures: can you make AI do the right things?

Anthropic’s research shows extremely high correlation between instruction quality and output quality. In other words, people who know how to ask questions get better results from AI. High-command individuals don’t just write prompts—they establish reusable workflows, design automation pipelines, and translate vague requirements into precise tasks AI can execute.

Verification methods: number of reusable workflows and system instructions, automation pipelines, AI tool integration breadth, complexity of multi-step task decomposition.

Delivery (25%) | Solving “People Who Only Talk About AI”

Great at discussing AI trends in meetings, but when quarter-end review comes, what actually shipped? Delivery measures not what you tried, but what you launched.

Verification methods: frequency and volume of version records, number of deployed services or tools, published content count, complete zero-to-launch projects, code quality metrics.

Leverage (20%) | Solving “Adding People Doesn’t Help” Bottlenecks

This is the core of “40 minutes vs. 15 man-days.” Leverage directly answers enterprises’ biggest concern: when this person joins, what’s the team’s capacity multiplier? How many times did the same cognitive investment get amplified through AI collaboration?

Verification methods: ratio of actual cognitive investment vs traditional estimates, number of parallel projects, AI tool orchestration quantity, automation coverage rate.

Quality (15%) | Solving “High Volume but No Quality” Risk

In an era when AI can generate content en masse, quality gatekeepers are unprecedentedly important. Anthropic’s research found that when task success rates are factored in, productivity improvement estimates shrink by about one-third. In other words, one-third of “acceleration” is actually quality discounting. Quality measures: does your output withstand scrutiny?

Verification methods: user or traffic numbers, system stability, number of quality control mechanisms (automated testing, review processes, audit systems), external citations or shares.

Influence (15%) | Solving “Strong Performers Can’t Be Replicated” Dilemmas

No matter how strong an individual is, if their methods only exist in their own head, organizational value has a ceiling. Influence measures: have your methods been learned or adopted by others? This is the transition from individual contributor to organizational enabler, and the highest-leverage form of value creation.

Verification methods: open source project interactions (stars, forks, contributors), workflow template adoption rates, educational content reach, external citations of methodologies.

Relationships Between Dimensions

To be honest: these five dimensions aren’t completely independent. High Quality is often a precursor to Influence. A powerful automation pipeline (Leverage) is itself a form of Delivery. These five capabilities don’t exist in isolation—they’re mutually reinforcing value loops.

This is a design feature, not a flaw. Real value creation is inherently multi-dimensional and interwoven. Revenue, gross profit, and net profit in financial statements also have causal relationships, but we don’t only look at one number because each number reveals different aspects. The five dimensions are five observation lenses, not five independent variables. The goal is richer observation, not perfect decomposition.

How to Prevent Portfolio from Becoming a Gaming System?

The most common criticism of any capability framework is whether it will devolve into a gaming system.

This concern is completely valid. Goodhart’s Law long ago pointed out: when a measure becomes a target, it ceases to be a good measure. No quantification system is immune from this rule. The real question isn’t whether it can be gamed, but whether the cost of gaming is high enough to make manipulation unprofitable.

AI Collaboration Portfolio responds with a three-tier evidence architecture:

Tier 1: Automated extraction. After user authorization, the system directly pulls data from third-party platforms, including version records, repository structures, automation process files, website analytics, package downloads, community metrics. This data is recorded by independent platforms and cannot be tampered with by users. However many commits you have on GitHub is how many you have. This tier is the “hard evidence” baseline for Portfolio.

Tier 2: Structured self-assessment. Parts that cannot be automatically extracted (number of parallel projects, traditional man-day estimates, AI tool integration methods) use structured forms for guided completion. Each self-assessment field has an adjacent “evidence link” field. Those with attached links are marked as Evidenced, those without are marked as Self-reported. Anyone viewing your Portfolio can see these markings and judge credibility themselves.

Tier 3: AI verification. After user authorization, AI analyzes repository structure and code patterns to independently suggest scores for each dimension. AI suggested scores are displayed alongside user self-assessments. If you self-assess 90 points but AI suggests 50, this gap gets visually highlighted.

The three tiers work together: automated data can’t be faked, self-assessments have transparent evidence marking, AI provides independent reference. This doesn’t eliminate gaming—it raises gaming costs to unprofitable levels.

More crucially, this framework’s ultimate defense isn’t internal verification but external reality. Quality and Influence dimensions require evidence from outside user control: user adoption rates, community interactions, customer feedback, market results. Teams might internally collude to manipulate vote counts, but can’t force market acceptance or fake real adoption by open source communities.

This system isn’t a machine for automatic judgment—it’s more like “forensic accounting” for talent assessment: instead of just listening to how compelling a candidate’s interview stories are, we audit their digital evidence chain to see how they define problems, allocate tasks, correct course, and achieve delivery in AI collaboration.

Three Sharpest Criticisms and Our Responses

We subjected this framework to multi-model adversarial debate, running three rounds of stress testing. Here are the three most powerful attacks and our responses.

Q1: AI can help people forge entire evidence chains.

AI can forge maps but can’t forge memories of walking those roads. In framework-guided deep interviews, interviewers don’t ask “what did you do” but “how did you make decisions”: why abandon that method? How did you weigh costs? How did you handle AI errors? People who actually did the work can answer three layers of follow-up questions. Those acting from scripts collapse at the third layer.

Q2: Leverage’s baseline (man-days) is something you yourself said is obsolete—using it as denominator is circular reasoning.

Leverage shouldn’t be interpreted as static multiplier against fixed baselines. Its core value is rate of change over time—your speed of improvement. An engineer who achieved 10x leverage with GPT-4 in 2024 and still only 10x with stronger models in 2026 exposes adaptability stagnation. AI tools evolve, baselines shift—what’s truly predictive is your speed of adapting to each paradigm shift, not absolute values at any moment.

Q3: Framework focuses on individuals, but highest-value contributions in the AI era are often shared cognitive assets.

This is a real limitation of the current version. When someone builds a prompt instruction library for the whole department to use, designs a workflow that doubles everyone’s efficiency—this network effect indeed cannot be fully captured by individual dimensions. The Influence dimension partially addresses this but insufficiently. Team-version framework measuring how one person amplifies entire system capacity, not just their own, is the most important evolution direction for the next phase.

💡 More Criticisms and Responses: For three extended questions about “why not add adaptability and ethics dimensions,” “causal attribution between dimensions,” and “whether author using himself as case study represents survivor bias,” see appendix at article end.

Case Validation: How Does One Person Run Eight Projects Simultaneously?

To verify the framework’s practical operability, I ran my own data through it. All following data is publicly verifiable.

Rather than listing scores item by item, let me show three contrasts that best demonstrate the framework’s diagnostic power:

Contrast 1: Four-language content system. My personal website paulkuo.tw has accumulated 80+ articles, each available in Traditional Chinese, English, Japanese, and Simplified Chinese, generated by automated translation pipelines triggered by single commits. This reflects not just Delivery (how much content delivered) but Command (can you design workflows that make AI do this for you). Traditional approach? Four translators, one editor, one project manager.

Contrast 2: Cross-domain simultaneous operations. Simultaneously maintaining eight+ active projects spanning software development, content publishing, e-commerce operations, enterprise consulting across completely different domains. Timing App’s time tracking data integrates into website real-time dashboards, providing actual cognitive investment hours. This directly embodies Leverage (capacity multiplier): traditionally, this would require five to eight-person teams.

Contrast 3: What the weakest dimension reveals. My Influence score is the lowest across all dimensions: open source projects exist but community interaction is limited, social media active but not yet scaled, important speeches and external citations almost zero. The framework’s diagnostic value is most obvious here: it doesn’t just validate strengths—it tells you with uncomfortably precise accuracy what to do next.

The more crucial observation isn’t the scores themselves but the nature of evidence. Every claim above can be independently verified through public URLs, GitHub repos, or website analytics APIs. None rely solely on self-declaration. This is the fundamental difference between this framework and all questionnaire-based assessments: evidence exists outside the system.

A framework designer using himself as a case study obviously has selection bias. But this case’s value isn’t statistical—it’s engineering: a complete, inspectable, replicable blueprint anyone can run with their own data. Your team can too. Think of it as open source release, not clinical trial.

The Cost of Not Measuring AI Productivity: Capability Gaps Are Invisibly Expanding

Why does this matter? Because the cost of not measuring is greater than measuring wrong.

The Anthropic Economic Index documents a deskilling effect: AI preferentially takes over high-skill components of work, leaving low-skill parts behind. If we can’t identify who truly excels at AI collaboration, high-capability individuals will be systematically undervalued while those who interview well but produce poorly will be overvalued.

Productivity gaps are real and expanding. College-level tasks get 12x AI acceleration, high school level only 9x. This isn’t gradual small differences accumulating over time—it’s structural divergence. And it’s currently completely invisible to labor markets because no measurement system can reveal it.

This framework also has boundaries. Three most important limitations need honest acknowledgment:

Individual vs System Value. The framework measures individual output, but truly high-value contributions in the AI era are often shared cognitive assets: a prompt library the whole department uses, an automation pipeline that accelerates everyone. How to measure one person’s amplification effect on system capacity, not just their own output, is the most important extension direction for the next phase.

Privacy Boundaries of Digital Traces. The first tier of the three-tier evidence architecture is automated extraction, but the line between “automated” and “surveillance” depends on who controls the data. If it’s employee voluntary authorization with Portfolio owned by individuals, it’s an empowerment tool; if it’s enterprise mandatory deployment with managers unilaterally accessing, it becomes digital surveillance. Technical architecture must lock data ownership to individuals from the start, or Portfolio becomes a new-era time clock.

B2B Pricing Commercial Reality. Framework can change how enterprises internally evaluate talent but can’t change how markets buy. Most client procurement processes remain locked to “man-day price quotes.” You can use Portfolio to prove you completed 15 man-days of work in 40 minutes, but client systems have no field for this non-linear output. Internal enterprise talent assessment can lead, but making this framework truly replace man-day pricing requires transformation of entire business ecosystem pricing conventions. This isn’t a problem one tool can solve—it’s an industry consensus requiring time.

💡 More Limitation Discussion: For issues regarding adaptability capture methods, cross-organizational standardization difficulty, fairness and accessibility, see appendix at article end.

These aren’t flaws to hide but research directions for next iteration rounds.

But don’t forget: the choice we face isn’t “this framework” vs “a more perfect framework” but “having a framework” vs “having nothing.” What does a world without frameworks look like? Just like now: enterprises screening people by resume keywords, making decisions based on interview impressions, yet completely unable to distinguish between “knows how to use AI” and “can use AI to single-handedly carry an entire team”—the massive gap between these.

What gets quantified gets valued. What doesn’t becomes invisible.

AI Collaboration Portfolio is an attempt to make the invisible visible. It’s an imperfect map—but the alternative isn’t a better map, it’s running blindfolded in new territory.

The current version targets individual users. Enterprise version supporting team dashboards, cross-departmental benchmark comparisons, and hiring process integration is the next development phase. If your organization is thinking about how to evaluate AI talent, welcome to start experiencing framework logic with the individual version.

→ Free assessment of your AI collaboration level: paulkuo.tw/tools/ai-collab-portfolio/

Appendix: More Limitation Discussion

How to see adaptability? Adaptability isn’t a sixth dimension—it’s the rate of improvement across the five dimensions over time. When someone’s Command, Delivery, and Leverage scores continuously improve between AI tool iterations, they have high adaptability. The framework indirectly captures this through score trajectories over time; future versions can incorporate self-iteration speed as a first-class signal.

Cross-organizational standardization difficulty. Command in software engineers can be verified through version records, but how to measure it in marketing directors or designers? Different industries, functions, and organizational sizes may have vastly different dimension definitions. The five dimensions provide observational structure, but specific quantification indicators and weight configurations inevitably need localized calibration by industry and function.

Fairness and accessibility. Those with access to latest AI tools naturally have advantages in this framework. If the measurement system itself amplifies digital divides rather than revealing true capabilities, it fails.

Appendix: More Criticisms and Responses

Q4: Why not add “Adaptability” and “Ethics” dimensions?

Adaptability isn’t an independent dimension—it’s the rate of improvement across the five dimensions over time. When someone’s Command, Delivery, Leverage scores continuously improve between AI tool iterations, they have high adaptability. Same logic applies to “reflection loops”—the meta-skill of using AI to analyze and optimize one’s own work patterns. It manifests as upward trajectories of the five dimensions over time, not static scores in a sixth column. Ethics is a baseline constraint, not a performance indicator—ethics violations should result in direct disqualification, not 15-point deductions.

Q5: Dimensions have causal relationships and cannot be precisely attributed.

Acknowledged. The framework pursues richer observation, not mathematically precise segmentation. Where attribution is unclear, the three-tier evidence architecture provides raw data for evaluators to judge themselves. This is exactly the framework’s design intent: it doesn’t replace human judgment—it gives human judgment better materials.

Q6: Author using himself as case study is survivor bias.

Methodologically completely valid. A framework designer naturally picks evidence favorable to themselves. The response isn’t denying bias but making bias structurally unimportant: every claim in the case study can be independently verified through public URLs, GitHub repos, or third-party analytics APIs. This case’s value isn’t statistical—it’s engineering. It provides a complete, inspectable, replicable blueprint anyone can run with their own data. Initial versions inevitably reflect creator context, but their value depends on whether the community finds them worth forking, attacking, improving.

References

Brooks, F. P. (1975). The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley.
Dakan, R. & Feller, J. (2025). “Framework for AI Fluency.” Ringling College of Art and Design / University College Cork. Version 1.5.
Anthropic. (2026, January). “Anthropic Economic Index report: Economic primitives.” anthropic.com/research.
Anthropic. (2026, March). “Anthropic Economic Index report: Learning curves.” anthropic.com/research.
Anthropic. (2026, March). “Estimating AI productivity gains from Claude conversations.” anthropic.com/research.
Anthropic. (2026, March). “Labor market impacts of AI: A new measure and early evidence.” anthropic.com/research.
US Department of Labor. (2026, February). “AI Literacy Framework.”
Alan Turing Institute / UK DSIT. (2025). “AI Skills for Business Competency Framework.” Version 3.
UNESCO. (2026). “AI competency framework for students.” unesdoc.unesco.org.
Schleiger, E. et al. (2025). “Generative AI in Human-AI Collaboration: Validation of the Collaborative AI Literacy and Collaborative AI Metacognition Scales.” Interacting with Computers. Taylor & Francis.
Chee, K.N. et al. (2025). “A Competency Framework for AI Literacy.” British Journal of Educational Technology. Wiley.

Item	Cost (USD)	Cost (TWD)
EN Translation	$1.675	$54.46
JA Translation	$0.149	$4.85
ZH-CN Translation	$0.149	$4.86
Total	$1.974	$64.17

40 Minutes to Complete 15 Man-Days of Work: What Can Man-Days Still Measure?

When Hours No Longer Equal Output, How Do Enterprises Rebuild Performance Measurement Frameworks?

Everyone’s Testing AI Literacy—Why Isn’t Anyone Testing Individual Output?

AI Collaboration Portfolio Five-Dimensional Model: What Kind of People Does Your Team Really Need?

Command (25%) | Solving “People Who Can’t Ask Questions”

Delivery (25%) | Solving “People Who Only Talk About AI”

Leverage (20%) | Solving “Adding People Doesn’t Help” Bottlenecks

Quality (15%) | Solving “High Volume but No Quality” Risk

Influence (15%) | Solving “Strong Performers Can’t Be Replicated” Dilemmas

Relationships Between Dimensions

How to Prevent Portfolio from Becoming a Gaming System?

Three Sharpest Criticisms and Our Responses

Case Validation: How Does One Person Run Eight Projects Simultaneously?

The Cost of Not Measuring AI Productivity: Capability Gaps Are Invisibly Expanding

Appendix: More Limitation Discussion

Appendix: More Criticisms and Responses

References

💬 Comments

Related Articles

Knowledge Management Relies on Pipelines, Not Discipline

Website Visitors Show Zero, But Dashboard Says 130

Turning paulkuo.tw into a Self-Evolving Website