Skip to main content

Overview

MAKER (Massively decomposed Agentic processes with K-voting Error Reduction) achieves high reliability by sampling a worker agent multiple times and using “first-to-ahead-by-k” voting to select the consensus response. This pattern trades compute for accuracy, enabling cheap models to achieve reliability suitable for million-step tasks.
Based on “Solving a Million-Step LLM Task with Zero Errors” (arXiv:2511.09030)
Credit: Lucid Programmer (PR author)

Key Features

  • Statistical Consensus: Multiple samples voted to find agreement
  • First-to-ahead-by-k: Winner needs k-vote margin over alternatives
  • Red-Flagging: Discard suspicious responses before voting
  • Provable Bounds: Mathematical error guarantees based on per-step success rate
  • Cost-Effective: Cheap models with voting can replace expensive models

When to Use MAKER

Ideal Use Cases

Long chains of simple steps where rare errors compound:
  • ETL Pipelines: 1000s of row transformations - one bad parse = corrupted data
  • Code Migration: 1000s of file changes - one syntax error = build fails
  • Document Processing: 1000s of pages - one missed field = compliance failure
  • Data Validation: Millions of records - one wrong validation = bad data in prod
  • Automated Testing: 1000s of assertions - one false positive = wasted debugging
  • Cost Optimization: Cheap model + voting replaces expensive model
When NOT to use MAKER:
  • Single classifications (just use a good model - 95% accuracy is fine)
  • Creative/open-ended tasks (no “correct” answer to vote on)
  • Complex reasoning (need smarter model, not more samples)
  • Tasks where occasional errors are acceptable

The Math Behind MAKER

95% per-step accuracy over 100 steps:
  0.95^100 = 0.6% overall success ❌

99.9% per-step accuracy (with MAKER) over 100 steps:
  0.999^100 = 90% overall success ✅

For million-step tasks:
  Even 99% per-step fails
  MAKER enables 99.99%+ per-step reliability

Basic Usage

import asyncio
from fast_agent import FastAgent

fast = FastAgent("MAKER Example")

# Define a classifier using a cheap model (Haiku)
@fast.agent(
    name="classifier",
    model="claude-3-haiku-20240307",
    instruction="""You are a customer support intent classifier.
Classify the customer message into exactly one of: COMPLAINT, QUESTION, REQUEST, FEEDBACK.
Respond with ONLY the single word classification, nothing else.

Examples:
- "This product is broken!" → COMPLAINT
- "How do I reset my password?" → QUESTION
- "Please cancel my subscription" → REQUEST
- "Just wanted to say I love the new feature" → FEEDBACK""",
)

# Wrap with MAKER for reliable, consistent classification
@fast.maker(
    name="reliable_classifier",
    worker="classifier",
    k=3,  # Require 3-vote margin for consensus
    max_samples=10,  # Max attempts before falling back to plurality
    match_strategy="normalized",  # Ignore case/whitespace differences
    red_flag_max_length=20,  # Discard verbose responses (should be one word)
)
async def main():
    async with fast.run() as agent:
        # Classify ambiguous customer messages
        result = await agent.reliable_classifier.send("I've been waiting for 3 days now.")
        print(result)

if __name__ == "__main__":
    asyncio.run(main())

Configuration Parameters

name
string
required
Name of the MAKER workflow
worker
string
required
Name of the worker agent to sample from
k
int
default:"3"
Voting margin required (first-to-ahead-by-k). Higher k = more reliable but more samples needed. Paper recommends k ≥ 3 for high reliability.
max_samples
int
default:"50"
Maximum samples before falling back to plurality vote
match_strategy
MatchStrategy
default:"exact"
How to compare responses for voting:
  • exact: Character-for-character match
  • normalized: Ignore case/whitespace
  • structured: Parse and compare JSON
match_fn
Callable[[str], str]
Custom normalization function (overrides match_strategy)
red_flag_max_length
int
Discard responses longer than this (characters). Per the paper, overly long responses correlate with errors.
red_flag_validator
Callable[[str], bool]
Custom validator function. Return False to red-flag (discard) the response.

How First-to-Ahead-by-k Works

Example with k=3:
Sample 1: "COMPLAINT"     Votes: {COMPLAINT: 1}
Sample 2: "COMPLAINT"     Votes: {COMPLAINT: 2}
Sample 3: "QUESTION"      Votes: {COMPLAINT: 2, QUESTION: 1}
Sample 4: "COMPLAINT"     Votes: {COMPLAINT: 3, QUESTION: 1}
Sample 5: "COMPLAINT"     Votes: {COMPLAINT: 4, QUESTION: 1}
                           Leader margin: 4 - 1 = 3 ✅
                           Winner: "COMPLAINT" (converged)

Match Strategies

Exact Match

match_strategy="exact"  # "Hello" ≠ "hello"

Normalized Match

match_strategy="normalized"  # "Hello World" = "hello  world" = "HELLO WORLD"

Structured Match

match_strategy="structured"  # {"a": 1, "b": 2} = {"b": 2, "a": 1}

Custom Match Function

def custom_normalizer(response: str) -> str:
    # Extract only digits
    return "".join(c for c in response if c.isdigit())

@fast.maker(
    name="number_extractor",
    worker="extractor",
    k=3,
    match_fn=custom_normalizer,  # Overrides match_strategy
)

Red-Flagging

Red-flagging improves effective success rate by discarding confused responses:

Length-Based Red-Flagging

@fast.maker(
    name="concise_classifier",
    worker="classifier",
    k=3,
    red_flag_max_length=20,  # Expect one-word answers
)

Custom Validation

def validate_classification(response: str) -> bool:
    valid_classes = {"COMPLAINT", "QUESTION", "REQUEST", "FEEDBACK"}
    return response.strip().upper() in valid_classes

@fast.maker(
    name="validated_classifier",
    worker="classifier",
    k=3,
    red_flag_validator=validate_classification,
)

Accessing Voting Results

result = await agent.reliable_classifier.send("Message to classify")

# Access detailed voting statistics
stats = agent.reliable_classifier.last_result
print(f"Winner: {stats.winner}")
print(f"Votes: {stats.votes}")  # e.g., {'COMPLAINT': 4, 'QUESTION': 1}
print(f"Total samples: {stats.total_samples}")
print(f"Discarded samples: {stats.discarded_samples}")
print(f"Winning margin: {stats.margin}")
print(f"Converged: {stats.converged}")  # True if k-margin achieved

Advanced Examples

Data Validation Pipeline

@fast.agent(
    "email_validator",
    model="haiku",
    instruction="""Validate if the input is a properly formatted email address.
Respond with only: VALID or INVALID""",
)
@fast.maker(
    name="reliable_email_validator",
    worker="email_validator",
    k=5,  # Very high reliability for data validation
    max_samples=15,
    match_strategy="normalized",
    red_flag_max_length=10,
)

async def validate_million_emails(emails: list[str]) -> list[bool]:
    results = []
    async with fast.run() as agent:
        for email in emails:
            result = await agent.reliable_email_validator.send(email)
            results.append(result == "VALID")
    return results

Code Syntax Checker

@fast.agent(
    "syntax_checker",
    model="gpt-3.5-turbo",  # Cheap model
    instruction="""Check if the Python code has syntax errors.
Respond with only: VALID or ERROR""",
)
@fast.maker(
    name="reliable_syntax_checker",
    worker="syntax_checker",
    k=3,
    max_samples=12,
    match_strategy="normalized",
)

async def check_codebase(files: list[str]) -> dict[str, bool]:
    results = {}
    async with fast.run() as agent:
        for file_path in files:
            with open(file_path) as f:
                code = f.read()
            result = await agent.reliable_syntax_checker.send(
                f"Check syntax: {code}"
            )
            results[file_path] = result == "VALID"
    return results

Structured Data Extraction

@fast.agent(
    "data_extractor",
    model="haiku",
    instruction="""Extract name, email, and phone from the text.
Respond with JSON: {"name": "...", "email": "...", "phone": "..."}""",
)
@fast.maker(
    name="reliable_extractor",
    worker="data_extractor",
    k=4,
    max_samples=20,
    match_strategy="structured",  # Compare parsed JSON
    red_flag_max_length=200,
)

async def extract_contact_info(documents: list[str]) -> list[dict]:
    results = []
    async with fast.run() as agent:
        for doc in documents:
            result = await agent.reliable_extractor.send(doc)
            results.append(json.loads(result))
    return results

Cost vs. Reliability Tradeoff

Higher k

More Reliable
  • Higher confidence in consensus
  • Better error bounds
  • More samples needed
  • Higher cost

Lower k

Faster/Cheaper
  • Quicker convergence
  • Fewer samples on average
  • Lower cost
  • Less strict consensus
Recommended k values:
  • k=2: Low-stakes, cost-sensitive
  • k=3: Standard (good balance)
  • k=5: High-stakes, critical accuracy
  • k=7+: Mission-critical, zero-error tolerance

Performance Characteristics

Typical Convergence (k=3, 95% per-step accuracy):
- Average samples: 5-7
- Convergence rate: 90%+
- Effective accuracy: 99.9%+

Cost Example:
- Worker: Haiku @ $0.25/MTok
- Average 6 samples per task
- 1000 tasks = 6000 calls
- Still cheaper than 1000 calls to expensive model

Best Practices

Simple Worker Tasks

MAKER works best with simple, deterministic tasks where there’s a “correct” answer

Red-Flag Aggressively

Discard obvious errors early to improve effective success rate

Appropriate k

Match k to your reliability needs and cost constraints

Monitor Convergence

Track convergence rates to tune k and max_samples

Debugging

Enable detailed logging to see voting progress:
import logging
logging.basicConfig(level=logging.DEBUG)

# You'll see:
# DEBUG: Sample 1: 1 votes for this response
# DEBUG: Sample 2: 2 votes for this response
# DEBUG: Sample 3: 1 votes for alternative response
# DEBUG: Sample 4: 3 votes for this response
# DEBUG: MAKER converged: 3 votes, margin 2, 4 samples

Use Cases by Industry

  • Finance: Transaction classification, fraud detection flags
  • Healthcare: Medical coding, diagnosis categorization
  • Legal: Document classification, clause identification
  • Manufacturing: Quality control checks, defect classification
  • E-commerce: Product categorization, review sentiment
  • DevOps: Log analysis, error classification

Comparison with Other Patterns

FeatureMAKEREvaluator-OptimizerChainRouter
Error Reduction✅ Statistical✅ Feedback-driven❌ None❌ None
Reliability Guarantee✅ Mathematical❌ Heuristic❌ None❌ None
Task TypeSimple, deterministicComplex, creativeAnyAny
Cost ModelMultiple samplesMultiple iterationsSingle passSingle pass
Best ForHigh-volume, zero-errorQuality contentPipelinesRouting
  • Evaluator-Optimizer - Quality through feedback (different approach)
  • Parallel - Multiple agents without voting
  • Chain - Sequential processing where MAKER can be a step