Introducing Agent Windtunnel

The deploy gate
for AI agents.

Catch prompt regressions before they reach users. Record production traffic, replay it against your new prompt, block bad deploys automatically.

windtunnel check

$ windtunnel check --fail-on-regression

✓ Fetching 20 production interactions...

✓ Testing challenger prompt v2...

✓ LLM judge scoring responses...

🚫DEPLOY BLOCKED80% regression

16/20 interactions regressed Exit: 1█

Trusted by teams building AI agents

AcmeDelphiCortexMeridianHelix

PRODUCT PREVIEW

See every regression before it ships

app.windtunnel-ai.vercel.app/dashboard

←

→

↻

Windtunnel

▶Runs

⚡Interactions

🔑API Keys

📄Docs

acme-corp

Windtunnel Runs

LIVE

Filter ▾

+ New Run

Total Runs

Blocked

Approved

Avg Regression

18%

StatusRun NameRegressionDateActions

BLOCKED

Support Bot v2 vs v3

47%

Mar 17, 2026

View →

APPROVED

Onboarding Agent v1 vs v2

Mar 16, 2026

View →

APPROVED

FAQ Bot v3 vs v4

12%

Mar 15, 2026

View →

app.windtunnel-ai.vercel.app/runs/r_9xk2p

Runs / Support Bot v2 vs v3

🚫

DEPLOY BLOCKED

47% regression exceeds 30% threshold

47%

regression rate

Interactions

Regressed

Improved

app.windtunnel-ai.vercel.app/runs/r_7mn3q

Runs / Onboarding Agent v1 vs v2

✅

APPROVED TO DEPLOY

8% regression is within 30% threshold

regression rate

Interactions

Regressed

Improved

Before & After

Stop flying blind.

✗

The Old Way

✗Ship prompt change to production
✗Watch user satisfaction drop
✗Get flooded with support tickets
✗Roll back manually 3 hours later

✓

With Windtunnel

✓Run windtunnel check in CI
✓Replay 20 real user conversations
✓Deploy blocked automatically
✓Ship with confidence

Integration

Dead simple integration.

Three steps from zero to protected deploys.

Record

2 lines of code captures every real user interaction your agent handles in production.

wt.record(
  user_input=q,
  agent_output=r
)

Test

Replay production interactions through both old and new prompts simultaneously.

result = wt.run_windtunnel(
  baseline_prompt=v1,
  challenger_prompt=v2
)

Block

LLM-as-judge compares responses. Fails CI if regression exceeds your threshold.

- run: windtunnel check
    --fail-on-regression

< 2 min

from commit to verdict

30%

regression threshold

lines of config needed

CI/CD

Plug into GitHub Actions.

.github/workflows/windtunnel.yml

name: Windtunnel Check

on:
  pull_request:
    paths: ['prompts/**']

jobs:
  windtunnel:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install windtunnel-ai
      - run: windtunnel check \
               --baseline @prompts/baseline.txt \
               --challenger @prompts/challenger.txt \
               --fail-on-regression
        env:
          WINDTUNNEL_API_KEY: ${{ secrets.WINDTUNNEL_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Pricing

Free and open source.

Self-host the dashboard, use the SDK, run as many tests as you need. No credit card, no limits.

Get started today

Start catching regressions today.

Free to start. No credit card. Works with any LLM framework.

The deploy gatefor AI agents.

See every regression before it ships

Windtunnel Runs

Stop flying blind.

Dead simple integration.

Record

Test

Block

Plug into GitHub Actions.

Free and open source.

Start catching regressions today.

The deploy gate
for AI agents.