Introducing Agent Windtunnel

The deploy gate
for AI agents.

Catch prompt regressions before they reach users. Record production traffic, replay it against your new prompt, block bad deploys automatically.

windtunnel check
$ windtunnel check --fail-on-regression
Fetching 20 production interactions...
Testing challenger prompt v2...
LLM judge scoring responses...
🚫DEPLOY BLOCKED80% regression
  16/20 interactions regressed   Exit: 1
Trusted by teams building AI agents
AcmeDelphiCortexMeridianHelix

PRODUCT PREVIEW

See every regression before it ships

app.windtunnel-ai.vercel.app/dashboard
WT
Windtunnel
Runs
Interactions
🔑API Keys
📄Docs
A
acme-corp

Windtunnel Runs

LIVE
Filter ▾
+ New Run
Total Runs
12
Blocked
3
Approved
9
Avg Regression
18%
StatusRun NameRegressionDateActions
BLOCKED
Support Bot v2 vs v3
47%
Mar 17, 2026
View →
APPROVED
Onboarding Agent v1 vs v2
8%
Mar 16, 2026
View →
APPROVED
FAQ Bot v3 vs v4
12%
Mar 15, 2026
View →
app.windtunnel-ai.vercel.app/runs/r_9xk2p
Runs / Support Bot v2 vs v3
🚫
DEPLOY BLOCKED
47% regression exceeds 30% threshold
47%
regression rate
20
Interactions
9
Regressed
11
Improved
app.windtunnel-ai.vercel.app/runs/r_7mn3q
Runs / Onboarding Agent v1 vs v2
APPROVED TO DEPLOY
8% regression is within 30% threshold
8%
regression rate
20
Interactions
2
Regressed
18
Improved

Before & After

Stop flying blind.

The Old Way
  • Ship prompt change to production
  • Watch user satisfaction drop
  • Get flooded with support tickets
  • Roll back manually 3 hours later
With Windtunnel
  • Run windtunnel check in CI
  • Replay 20 real user conversations
  • Deploy blocked automatically
  • Ship with confidence

Integration

Dead simple integration.

Three steps from zero to protected deploys.

01

Record

2 lines of code captures every real user interaction your agent handles in production.

wt.record(
  user_input=q,
  agent_output=r
)
02

Test

Replay production interactions through both old and new prompts simultaneously.

result = wt.run_windtunnel(
  baseline_prompt=v1,
  challenger_prompt=v2
)
03

Block

LLM-as-judge compares responses. Fails CI if regression exceeds your threshold.

- run: windtunnel check
    --fail-on-regression
< 2 min
from commit to verdict
30%
regression threshold
0
lines of config needed

CI/CD

Plug into GitHub Actions.

.github/workflows/windtunnel.yml
name: Windtunnel Check

on:
  pull_request:
    paths: ['prompts/**']

jobs:
  windtunnel:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install windtunnel-ai
      - run: windtunnel check \
               --baseline @prompts/baseline.txt \
               --challenger @prompts/challenger.txt \
               --fail-on-regression
        env:
          WINDTUNNEL_API_KEY: ${{ secrets.WINDTUNNEL_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Pricing

Free and open source.

Self-host the dashboard, use the SDK, run as many tests as you need. No credit card, no limits.

Get started today

Start catching regressions today.

Free to start. No credit card. Works with any LLM framework.