AI is an Unstable API. Treat it like one.

AP
Aditya Pandey
Jan 16, 20268 min read

AI Behavior Drift

Here is a scenario every AI engineer knows, though few admit to it in public:

On Monday, you write a perfect prompt for your refund bot. It is concise, polite, and handles edge cases beautifully. You ship it. On Tuesday, OpenAI updates gpt-4-turbo with a minor "steerability" patch. Or maybe you tweak the system prompt to handle a new edge case. On Wednesday, a customer support ticket comes in. Your bot has started replying to refund requests with 500-word philosophical essays about the nature of commerce.

Your code didn't change. Your logic didn't break. But your product is broken.

This is the fundamental crisis of modern AI Engineering: We are building probabilistic software (AI) using deterministic tools (Unit Tests).

If function add(a, b) returns 5 today, it will return 5 forever. If prompt("Refund me") returns "Sure" today, it might return "As an AI language model..." tomorrow.

We are trying to manage this chaos with "Vibe Checks" manually running the prompt a few times and saying, "Yeah, looks good." This is not engineering; it is gambling.

I realized we needed a new primitive. We don't need another complex evaluation dashboard. We need Git for Behavior.

That is why I built SafeStar.

The Protocol: Snapshot Testing for Intelligence

In traditional software, we have Snapshot Testing (popularized by Jest). You render a UI component, save the HTML to a file, and commit it. If a future commit changes that HTML, the test fails. You have to explicitly approve the change.

Why aren't we doing this for AI?

The industry is obsessed with "Evaluation" giving the AI a score out of 100 for correctness. But in 90% of business cases, you don't care if the answer is "95% correct." You care if it changed.

Drift is the enemy, not inaccuracy.

If your bot was "Good Enough" on Monday, your only job is to ensure it is still "Good Enough" on Tuesday.

This requires a new mental model, which I call the Behavior Lock Protocol:

  1. Define a reproducible scenario (Prompt + Inputs).
  2. Snapshot the outputs when they are "Good."
  3. Diff every future run against that snapshot.

Introducing SafeStar

SafeStar is not a SaaS. It is not a startup trying to sell you credits. It is an open-source, local-first CLI tool that implements this protocol.

It answers one question: "Did my AI behave differently than before?"

How it Works

SafeStar treats your AI simply as an executable command. It doesn't care if you use Python, Node, LangChain, or a bash script.

You define a scenario in YAML:

# scenarios/refund_bot.yaml
name: refund_flow
prompt: "I want a refund. Your product broke."
exec: "python3 agent.py"  # Your actual code
runs: 5                   # Run it 5 times to catch randomness
checks:
  max_length: 200         # Guardrails
  must_not_contain:
    - "I am just an AI"

You run it once to establish a baseline:

npx safestar run scenarios/refund_bot.yaml
npx safestar baseline refund_flow

Now, you have a frozen snapshot of "Good Behavior" committed to your repo.

The "Diff" Engine

This is where the magic happens. When you change your code or the model updates, you run:

npx safestar diff scenarios/refund_bot.yaml

SafeStar doesn't just check text equality (which would always fail for AI). It calculates Statistical Drift.

--- SAFESTAR REPORT ---
Status: FAIL

Metrics:
  Avg Length: 45 chars -> 120 chars
  Drift:      +166% vs baseline (WARNING)
  Variance:   0.2 -> 9.8 (High instability)

Violations:
  - must_not_contain: "I am just an AI": failed in 3 runs

It tells you:

  1. Drift: "Your output is 166% longer than usual."
  2. Variance: "Your model has become unstable/random."
  3. Regressions: "You triggered a negative constraint."

Why This Matters (The Infrastructure View)

We are moving past the "Demo Phase" of Generative AI. We are entering the "Infrastructure Phase."

In the Demo Phase, a cool answer is a win. In the Infrastructure Phase, a surprising answer is a bug.

Tools like SafeStar are the guardrails that allow you to deploy with confidence. By adding safestar diff to your CI/CD pipeline (GitHub Actions), you effectively block any pull request that causes your AI to drift too far from the baseline.

It turns "I think it works" into "The variance score is 0.0, and it passed all heuristics."

Get Started

SafeStar is open source and available on NPM today.

You don't need to sign up. You don't need an API key. You just need to care about the quality of your software.

Stop trusting the vibes. Start trusting the diff.

The End of the Vibe Check: Introducing SafeStar