AutoResearch Marketing Automation: Karpathy's AI Loop Strategy

Introduction: The AutoResearch Pattern That Changes Everything

Andrej Karpathy, co-founder of OpenAI and former head of AI at Tesla, released something called autoresearch. The concept is dead simple: let an AI agent modify code, run a test, check if the result improved, keep or revert, and repeat. He ran 700 experiments in two days. He woke up to a better model.

That's not marketing. But the pattern underneath it is universal.

We've taken this exact pattern and applied it to content quality at AI Topia, where we build autonomous AI systems for B2B companies. Not just A/B testing subject lines. Actual writing quality, tone of voice, brand consistency -- things that everyone says still need human judgment.

They don't. Not if you build the loop right.

Traditional marketing automation platforms like HubSpot and Marketo follow preset rules. They execute workflows you've already programmed. AutoResearch flips this entirely. It's an autonomous system that discovers what works through continuous experimentation, keeping only improvements that beat current performance.

The shift isn't about speed. It's about moving from human-designed campaigns to AI-discovered strategies. And for B2B SaaS founders and marketing managers, this is already happening. Companies implementing autoresearch principles are seeing 2-5x performance improvements within months through automated marketing operations.

What Karpathy's AutoResearch Pattern Actually Is

Karpathy's repo is on GitHub. Three files matter:

1. train.py -- This is the only file the AI agent is allowed to modify. It contains the model architecture, the optimizer, hyperparameters, everything. This is the editable asset.

2. prepare.py -- This is read-only. It has the evaluation function and the data. The agent cannot touch this. Think of it as the rules of the game.

3. program.md -- This is the instruction file. The human writes this. It tells the AI agent what to do and how to loop.

The loop itself is clear. Karpathy wrote it in program.md:

"LOOP FOREVER: Look at git state. Tune train.py with an experimental idea. Git commit. Run the experiment. Read the results. If val_bpb improved, keep the commit. If val_bpb is equal or worse, git reset back."

And then this line -- this is the important one:

"NEVER STOP. Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep."

The metric he uses is called val_bpb -- validation bits per byte. Lower is better. It measures whether a language model got smarter.

But here's the key insight that most people miss. The pattern is not about machine learning. The pattern is about any problem where you have:

One thing you can change (the editable asset)
One way to measure if it got better (the metric)
A loop that runs automatically (the cycle)

That's it. One editable asset. One metric. One loop. Apply this to marketing and you have self-improving campaigns.

How Others Have Applied It

Nick Saraev applied autoresearch to cold email. His editable asset was the email copy. His metric was reply rate. His loop ran every 4 hours through GitHub Actions. Challenger email vs baseline email. Whichever gets more replies becomes the new baseline. Self-improving cold email.

Simon Scrapes applied it to Claude Code skills. His editable asset was the SKILL.md file. His metric was binary assertions -- things like "is the post under 300 words? Does the first line stand alone? Are there zero emojis?" True or false, pass or fail. His loop modified the skill instructions, re-ran the test, checked if more assertions pass, kept or reverted.

But then Simon said something important:

"The binary loop handles structure, format, word counts, forbidden patterns, but it does NOT handle tone of voice, creative quality, whether your skill is actually using the context properly. Those still need human judgment."

That's exactly where the gap is. And that's exactly what we solved.

The Missing Layer: LLM-as-Judge for Content Quality

Everyone else is optimizing for things you can count. Word count, reply rate, pass/fail. But writing quality is not binary. "Is this hook good?" is not a yes or no question. "Does this post sound like me?" is not a yes or no question.

So we need two types of scoring:

Layer 1: Binary Checks

The stuff Simon Scrapes already showed. Under 300 words? Check. No emojis? Check. Has a CTA? Check. These are automatable with simple Python functions. They keep the structure tight.

Layer 2: LLM-as-Judge

This is the breakthrough. We use Claude itself to score the subjective parts:

"Would you stop scrolling for this hook? Score 1 to 10."
"Does this teach one clear thing? Score 1 to 10."
"Does this sound human, not AI-generated? Score 1 to 10."

Then we combine them. 60% binary pass rate + 40% judge score. One composite number. That's our metric. That's our val_bpb equivalent.

Layer 3: Human Feedback

This is what nobody else has. You review the outputs, write what you liked and didn't like in a feedback file, and the loop reads that feedback and applies it as rules. The system doesn't just self-improve from automated checks. It learns from you.

After a month of nightly runs, the skill writes in your voice because it learned from every correction you made.

Building the AutoResearch Loop for Content

Here's the full setup. No fluff. Every step you need.

Step 1: Create the Skill Folder

mkdir -p ~/.claude/skills/linkedin-writer/{references,evals,scripts}

You need four things:

SKILL.md -- your editable asset (Karpathy's train.py equivalent)
references/ -- tone guide and example posts
evals/ -- test prompts and assertions
scripts/ -- the judge script that scores quality

Step 2: Create the Starting Skill

This is version 1. Basic on purpose. The whole point is the loop makes it better.

# LinkedIn Post Writer

Write a LinkedIn post based on the topic.

## Rules
- Hook: first line must be standalone, under 10 words
- Max 3 lines per paragraph
- Total post under 300 words
- Use at least one specific number
- End with a clear CTA
- No emojis
- Use contractions
- Write like talking to a friend

Eight rules. Almost too simple. That's the point.

Step 3: Add Your References

In references/tone-guide.md, put your brand voice rules. Direct, no fluff, specific numbers, contrarian takes welcome. In references/examples.md, paste 3-5 of your best-performing posts. These are the training data for the skill.

Step 4: Create the Eval File

This is where you define what "good" looks like. Create evals/evals.json with test prompts and assertions:

{
  "test_prompts": [
    "Write a LinkedIn post about why most companies waste money on marketing agencies",
    "Write a LinkedIn post about how AI agents replaced our content team",
    "Write a LinkedIn post about the biggest mistake in B2B content marketing"
  ],
  "assertions": [
    {"id": "hook_standalone", "type": "binary", "description": "First line is a standalone sentence under 10 words"},
    {"id": "word_count", "type": "binary", "description": "Total post is under 300 words"},
    {"id": "has_number", "type": "binary", "description": "Contains at least one specific number or statistic"},
    {"id": "no_emojis", "type": "binary", "description": "Contains zero emojis"},
    {"id": "has_cta", "type": "binary", "description": "Ends with a clear call to action"},
    {"id": "short_paragraphs", "type": "binary", "description": "No paragraph exceeds 3 lines"},
    {"id": "scroll_stopper", "type": "judge", "description": "Would this make someone stop scrolling? (1-10)"},
    {"id": "teaches_one_thing", "type": "judge", "description": "Does this post teach exactly one clear takeaway? (1-10)"},
    {"id": "sounds_human", "type": "judge", "description": "Does this sound like a real person, not AI-generated? (1-10)"},
    {"id": "uses_context", "type": "judge", "description": "Does this use the tone guide and examples properly? (1-10)"}
  ]
}

Binary checks keep structure tight. Judge scores push quality up.

Step 5: Create the Judge Script

This is the scoring engine. It does three things:

Runs the skill -- generates a LinkedIn post from a test prompt
Checks binary assertions -- simple Python functions (word count, emoji detection)
Runs LLM-as-judge -- sends the post to Claude with the criterion, gets a score from 1 to 10

Then it calculates the composite score: 60% binary pass rate + 40% judge average, normalized to a 0-1 scale.

import json, subprocess, re
from anthropic import Anthropic

client = Anthropic()

def run_skill(prompt: str, skill_path: str) -> str:
    result = subprocess.run(
        ["claude", "-p", f"Using the skill at {skill_path}/SKILL.md and references in {skill_path}/references/, {prompt}",
         "--allowedTools", "Read,Write,Edit,Glob,Grep", "--output-format", "json"],
        capture_output=True, text=True, timeout=120
    )
    response = json.loads(result.stdout)
    return response.get("result", "")

def check_binary(post: str, assertion: dict) -> bool:
    checks = {
        "word_count": lambda p: len(p.split()) <= 300,
        "no_emojis": lambda p: not any(ord(c) > 127462 for c in p),
        "has_number": lambda p: bool(re.search(r'\d+', p)),
        "hook_standalone": lambda p: len(p.split('\n')[0].split()) <= 10,
        "has_cta": lambda p: any(w in p.lower() for w in ['comment', 'reply', 'dm', 'link in']),
        "short_paragraphs": lambda p: all(len(para.split('\n')) <= 3 for para in p.split('\n\n'))
    }
    return checks.get(assertion["id"], lambda p: True)(post)

def judge_score(post: str, criterion: str) -> float:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=50,
        messages=[{"role": "user", "content": f"Score this LinkedIn post 1-10 on: {criterion}\n\nPost:\n{post}\n\nRespond with ONLY a number 1-10."}]
    )
    return float(re.search(r'\d+', response.content[0].text).group()) / 10

def evaluate(skill_path: str, evals_path: str) -> float:
    with open(evals_path) as f:
        evals = json.load(f)
    binary_scores, judge_scores = [], []
    for prompt in evals["test_prompts"]:
        post = run_skill(prompt, skill_path)
        for assertion in evals["assertions"]:
            if assertion["type"] == "binary":
                binary_scores.append(1.0 if check_binary(post, assertion) else 0.0)
            else:
                judge_scores.append(judge_score(post, assertion["description"]))
    binary_avg = sum(binary_scores) / len(binary_scores) if binary_scores else 0
    judge_avg = sum(judge_scores) / len(judge_scores) if judge_scores else 0
    return 0.6 * binary_avg + 0.4 * judge_avg

Step 6: Create the Program

This is your program.md -- the equivalent of Karpathy's. The meta-prompt that makes Claude Code autonomous.

The loop has 4 phases:

Phase 1: Baseline -- Run the eval, get the starting score, commit.

Phase 2: Hypothesize -- Read which assertions failed. Read any human feedback. Pick ONE thing to improve in SKILL.md. Make the edit. Commit.

Phase 3: Test -- Run the eval again. Get the new score.

Phase 4: Decide -- Score improved? Keep. Score dropped? Git revert. Log everything to results.tsv. Go back to Phase 2.

And just like Karpathy:

"NEVER STOP. Once the loop begins, do not pause to ask. The human may be asleep."

Step 7: Initialize and Run

cd ~/.claude/skills/linkedin-writer
git init && git add -A && git commit -m "v1: initial skill"
git checkout -b autoresearch/v1

Start Claude Code and tell it:

"Read program.md and execute the autoresearch loop. Start with baseline."

Watch what happens. Baseline score: 0.72. The system sees that scroll_stopper scored 4/10. So it adds a rule: "Hooks must be contrarian or surprising -- never start with a question." Score jumps to 0.78. Keep.

Next iteration: teaches_one_thing scored low. Adds a rule: "Every post must have exactly ONE takeaway. Not a list of tips." Score: 0.84. Keep.

After 10 iterations: 0.72 to 0.88. After 20: 0.92. The SKILL.md started with 8 basic rules. Now it has 18 specific, tested rules that actually make the content better. Every single one validated by the scoring system. No guessing.

The Secret Weapon: Human Feedback Loop

Here's what makes this different from everything else.

After reviewing some outputs, create evals/feedback.md:

# Human Feedback

- The hooks are getting better but still too safe. I want "wait, what?" reactions.
- Stop using "Here's the thing" -- it's overused on LinkedIn.
- Posts that start with a bold claim perform 3x better than posts starting with a question.
- The CTA should never be "What do you think?" -- use "Comment KEYWORD" format.

In program.md, Phase 2 says: Read feedback.md and prioritize human feedback over automated signals.

The next time the loop runs, it reads your feedback, converts it into rules, tests if those rules actually improve the composite score, and keeps only the ones that work.

This is the self-learning loop. Not just structural checks getting better. Your taste, your preferences, your brand voice getting baked into the system over time.

Run this every night for a month. By the end, the skill writes in your voice because it learned from every correction you made.

Why Marketing Is the Perfect Domain for AutoResearch

Marketing operates in a data-rich environment that provides the rapid feedback loops autoresearch needs. Unlike product development, marketing campaigns generate measurable results within hours. Email open rates, click-through rates, conversion percentages -- these provide the clear success signals autonomous AI systems require.

The high-volume, low-stakes nature of marketing testing makes it ideal. Testing 100 different email subject lines carries minimal risk compared to testing 100 product features. If an email campaign performs poorly, you've lost a few hours. If a product breaks, you lose customers.

Marketing also has what we call "metric clarity." Success has quantifiable definitions: higher conversion, lower CAC, increased LTV. These translate directly into evaluation criteria for the autoresearch loop.

Consider the typical A/B testing workflow. A human marketer creates two subject line variations, sets up the test, waits for statistical significance (1-2 weeks), analyzes results, implements the winner. One test cycle per month if you're efficient.

AutoResearch runs that same cycle 50 times per day. Create variations, launch to small test segments, measure performance after 2-4 hours, implement winners automatically. Instead of 12 optimization cycles per year, you get thousands.

AutoResearch Applications Beyond Content

Email Marketing

The autoresearch loop for email: generate subject line variations based on performance patterns, send to small test segments (5-10% of your list), measure open and click rates after a statistically meaningful period, send the winner to the remaining audience. One implementation improved open rates by 40% over six months.

Advanced implementations test email structure, CTA placement, personalization strategies, and optimal send times per individual recipient. The system learns that certain segments respond to question-based subject lines while others prefer direct benefit statements.

Paid Advertising

Google Ads, Facebook, LinkedIn -- autoresearch manages bid strategies, tests creative variations, and refines audience targeting simultaneously. Instead of manual daily campaign management, the system makes hundreds of micro-adjustments automatically.

Landing Pages

Traditional landing page tests take weeks. AutoResearch tests headline variations, button text, form layouts, and page structure simultaneously. One client's system tested 15 headlines, 8 button options, and 4 form layouts in two weeks. It discovered shorter headlines worked for mobile and longer, benefit-heavy headlines for desktop. Auto-serving different versions resulted in 60% higher conversions.

ROI: The Math That Makes This Obvious

Metric	Traditional	AutoResearch
Test cycles per month	1-2	50-100+
Time to optimize one channel	3-6 months	2-4 weeks
Manual optimization hours/week	10-15 hours	1-2 hours (review only)
Campaign performance improvement	10-20% annual	2-5x in 90 days

The initial investment for custom autoresearch implementation ranges from $10,000-25,000 for setup. Ongoing costs are typically $200-500 monthly for API calls. Total monthly cost rarely exceeds $1,000.

A concrete example: B2B SaaS company spending $10,000/month on ads implements autoresearch. Setup costs $25,000, ongoing $500/month. Within three months, CPA improves 40% -- from $100 to $60 per customer. That saves $4,000/month. The system pays for itself in 6 months.

But the real value is compound. The system doesn't just optimize current campaigns. It discovers entirely new strategies. One client's system found that video testimonials from specific industries drove 3x higher conversions, leading to a strategy overhaul that doubled overall marketing ROI.

Scaling to Multiple AutoResearch Loops

Now imagine this isn't just one skill. Imagine you have:

linkedin-writer/ -- running its own autoresearch loop
twitter-writer/ -- running its own loop
newsletter-writer/ -- running its own loop
article-writer/ -- running its own loop
video-script-writer/ -- running its own loop
email-optimizer/ -- running its own loop

Each one improving independently. Each one learning from your feedback. Each one getting sharper every night while you sleep.

This is what we're building at AI Topia. Over 40 agents. Each one running its own improvement loop. Generating drafts at 2 AM. You review in the morning. Every approval, every edit, every rejection feeds back into the system.

After 3 months, the system writes like your team. Not generic AI content. Your content. Your voice. Your standards. But at machine speed.

Coordination Across Channels

When multiple autoresearch loops run simultaneously, you need coordination. An email system might optimize for promotional messaging while a content system optimizes for educational -- creating mixed brand messaging.

We implement a hub-and-spoke model where a central coordination system monitors optimization decisions across all channels and intervenes when conflicts arise. Shared data infrastructure ensures all systems access the same customer data, performance metrics, and brand guidelines.

Component	Function
Central Data Hub	Unified customer and performance data
Coordination Engine	Cross-channel decision arbitration
Channel Agents	Platform-specific optimization loops
Monitoring Dashboard	Performance and conflict detection

Common Failure Modes and How to Prevent Them

Budget runaway. Without constraints, an AI agent might interpret "optimize performance" as "spend unlimited money." Hard budget caps at the platform level. Daily spending limits. Real-time alerts when spending accelerates beyond normal patterns.

Brand safety. Optimizing for clicks might create sensationalist headlines that damage reputation. Content guardrails: pre-approved messaging frameworks, keyword blacklists, sentiment analysis thresholds, mandatory human review for significant deviations.

Feedback loop corruption. Optimizing solely for open rates might evolve toward sensationalist subject lines that hurt sender reputation. Balanced evaluation criteria: include long-term health metrics alongside short-term performance. The system can't sacrifice deliverability or brand sentiment for short-term gains.

Data quality issues. Misconfigured tracking means the system optimizes for wrong metrics. Data validation systems cross-reference multiple sources. When quality issues are detected, autonomous operations pause until humans resolve the problem.

The local maximum trap. The system finds a good-enough pattern and stops exploring. The LLM-as-judge layer prevents this because judge criteria push for novelty and quality, not just structural compliance.

AutoResearch vs. Traditional Marketing Automation

Feature	Traditional Platforms	AutoResearch Systems
Test Creation	Manual setup required	Autonomous generation
Optimization Speed	1 test/month typical	10-50 tests/day
Quality Assessment	Binary metrics only	Binary + LLM-as-judge + human feedback
Cross-channel	Limited integration	Native orchestration
Learning	None -- same rules forever	Compounds daily from every review
Content Quality	Template-dependent	Self-improving over time
Brand Voice	Manual enforcement	Learned and validated automatically

Traditional platforms execute workflows you've programmed. AutoResearch discovers what works through continuous experimentation. It's the difference between a player piano and a jazz musician -- one executes programmed sequences, the other improvises based on response.

Many companies adopt a hybrid approach: traditional platforms for basic workflow management, autoresearch for high-impact optimization like content quality, email copy, and landing page conversion.

Getting Started: The Minimum Viable AutoResearch Loop

You don't need to build everything at once. Start with this:

Pick one content type -- LinkedIn posts, email subject lines, or ad copy
Write 8-10 basic rules -- your SKILL.md v1
Define 6 binary assertions + 4 judge criteria -- your eval file
Set up git tracking -- branch per experiment batch
Run 10 iterations manually -- watch the scores climb
Add human feedback -- review outputs, write what you liked and didn't
Automate nightly -- use claude -p in headless mode with a cron job

The compound effect is what matters. 10 iterations improves quality noticeably. 100 iterations transforms it. 1,000 iterations -- running nightly for a few months -- produces a system that writes better than most human marketers because it has been systematically tested and refined on every dimension you care about.

If you want us to set this up for your business -- not just one skill, but an entire marketing operation with 40+ agents all improving themselves -- book a strategy call. We deploy the full system in under 60 days.

Join our free AI community for the autoresearch starter kit, judge scripts, and program.md templates. 1,000+ builders automating with AI, no hype.

Frequently Asked Questions

What is Andrej Karpathy's AutoResearch pattern?

AutoResearch is an autonomous AI system that runs experiments in a continuous loop, keeping only improvements that beat current performance. Originally developed for ML research, it uses a three-file architecture (train.py as the editable asset, prepare.py as the read-only evaluation, and program.md as the loop instructions) to modify code, run experiments, and evaluate results without human intervention. The key principle: one editable asset, one scalar metric, one automated loop.

How is AutoResearch different from standard A/B testing?

Standard A/B testing requires human setup and analysis, typically completing one test cycle per month. AutoResearch runs autonomously and continuously -- testing dozens of variations daily, automatically implementing winners, and evolving strategies based on results. The addition of LLM-as-judge scoring means it can optimize subjective quality (tone, engagement, persuasiveness) not just countable metrics like click rates.

Can AutoResearch really improve content quality, not just metrics?

Yes. The LLM-as-judge layer is what makes this possible. Instead of only checking binary metrics (word count, format compliance), the system uses Claude to score subjective criteria like "Would this make someone stop scrolling?" and "Does this sound human?" Combined with human feedback that gets injected into the loop, the system improves actual writing quality -- tone, hooks, persuasiveness, brand voice -- not just structural compliance.

What marketing channels work best with AutoResearch?

Email marketing, LinkedIn content, and landing pages work best due to clear metrics and rapid feedback loops. Content writing benefits enormously from the LLM-as-judge approach since quality is subjective. Paid advertising works well for bid and creative optimization. Social media and long-form content are more challenging but become viable with the composite scoring approach (binary + judge + human feedback).

Do you need coding skills to implement this?

Basic Python knowledge helps for the judge script, but the core loop runs through Claude Code which handles the automation. The SKILL.md, evals.json, and program.md files are plain text -- no programming required. For teams without technical resources, AI Topia handles full implementation and deployment.

What's the biggest risk of autonomous content optimization?

Brand voice drift -- where the system optimizes for engagement metrics at the expense of brand consistency. The human feedback layer prevents this. You review outputs, note what sounds off-brand, and the loop incorporates your corrections. The combination of binary guardrails + judge scores + human feedback creates a system that improves quality while staying on-brand.

How does AutoResearch integrate with existing marketing tools?

AutoResearch sits as an orchestration layer above your existing tools. For content, it uses Claude Code skills that generate and refine content. For email and ads, it integrates through platform APIs (Google Ads, email platforms, analytics). The system makes decisions and executes changes without replacing your core platforms. Most modern marketing tools provide REST APIs that enable programmatic optimization.

What ROI can teams expect?

Early adopters report 2-5x improvements in campaign performance and 70% reduction in manual optimization time within 90 days. Content teams see the most dramatic improvements -- skills that start at 0.72 composite scores reach 0.90+ after 20 iterations. The compound nature of continuous optimization means benefits accelerate over time. Most implementations achieve positive ROI within the first quarter.

AutoResearch Marketing Automation: Karpathy's AI Loop for Self-Improving Content