The 'Works on My NPU' Problem: Why Local AI Needs Cloud CI

You just shipped your new “AI-Powered” note-taking app. It uses Ollama and a quantized Llama-3-8B model running locally on the user’s machine. No API keys. No cloud bills. Complete privacy.

It runs beautifully on your M3 Max MacBook Pro. Token generation is instant. The summarization feature feels like magic.

You deploy. You go to sleep.

You wake up to a GitHub Issues page that looks like a crime scene.

Issue #402: “App crashes on launch. Windows 11, RTX 3060.”
Issue #403: “Summarization takes 4 minutes and freezes my laptop. Intel MacBook Air 2020.”
Issue #404: “Model hallucinates garbage JSON. Linux, CPU only.”

You have just discovered the “Works on My NPU” problem.

The Matrix of Hell

In traditional web development, we solved “Works on My Machine” with Docker. Containers gave us a predictable, reproducible environment from dev to prod.

In Local AI development, Docker is not enough. You are at the mercy of the Hardware Matrix.

Silicon Diversity: Apple Silicon (Metal), NVIDIA (CUDA), AMD (ROCm), Intel (OpenVINO), CPU-only fallback.
VRAM Constraints: Does the user have 8GB, 16GB, or 24GB of unified memory?
Quantization Sensitivity: A q4_0 model might work on your machine, but a q8_0 crashes theirs.

You cannot buy every laptop on the market to test your app. And you cannot replicate these constraints on your localhost.

The CI Blind Spot

So, you turn to your CI pipeline. You try to run integration tests in GitHub Actions.

“Error: No GPU found.”

Standard CI runners are designed for CPU tasks. They cannot run 8B parameter models efficiently. Even if they could, they cannot simulate the specific VRAM constraints of your users.

You are flying blind. You are shipping probabilistic software on deterministic infrastructure.

Enter the Cloud Device Lab

To ship Local AI with confidence, you need to break the paradox: You need the Cloud to test Local.

You need an ephemeral infrastructure that can spin up, configure a specific hardware profile, run your app’s inference logic, and tear itself down.

This is where PrevHQ changes the game for the Local-First AI Architect.

1. The Ephemeral Ollama Instance

Instead of mocking your AI calls (which defeats the purpose of testing the model), you spin up a real PrevHQ instance with Ollama pre-installed.

You configure the environment to match your target user profiles:

Profile A: High-End (NVIDIA GPU, 24GB VRAM).
Profile B: Mid-Range (Apple M1 Simulation, 16GB RAM).
Profile C: Low-End (CPU Only, 8GB RAM).

2. The Golden Prompt Suite

You don’t just run unit tests. You run Behavioral Regression Tests.

You feed the instance a set of “Golden Prompts”—inputs with known, expected outputs.

Input: “Summarize this meeting notes JSON.”
Expected Output: A valid JSON object with summary and action_items keys.

If the model on Profile C (Low-End) returns broken JSON because it ran out of context window or VRAM, the test fails.

You catch the bug before the user on the 2020 MacBook Air does.

3. Visual Debugging

When a test fails in CI, logs aren’t enough. You need to see why the model is hallucinating. With PrevHQ, you get a live preview URL for the failing instance.

You can exec into the container, watch the ollama run output stream, and see exactly where the inference broke down.

Stop Shipping Blind

The future of AI is local. Privacy, latency, and cost demand it. But “Local” does not mean “Untested.”

If you are building the next great Local AI app, don’t let hardware fragmentation kill your adoption. Treat your model like code. Test it in the cloud.

Ship with confidence, no matter whose machine it runs on.

FAQ: Testing Local LLM Apps in CI/CD

Q: Why can’t I just use GitHub Actions for testing Ollama apps?

A: Standard GitHub Actions runners are CPU-only and have limited RAM. Running even a small 8B model will be excruciatingly slow or crash due to memory limits. They also cannot simulate different hardware accelerators (CUDA/Metal), making it impossible to test hardware-specific bugs.

Q: How do I test local LLM apps in CI/CD?

A: You need an ephemeral environment provider that supports GPU acceleration or high-memory instances. Tools like PrevHQ allow you to spin up on-demand containers with Ollama pre-installed, run your test suite against them, and destroy them, giving you a “clean slate” for every test run.

Q: What is the “Works on My NPU” problem?

A: It is the AI version of “Works on My Machine.” A model that performs well on a developer’s high-end machine (e.g., M3 Max with 128GB RAM) may fail, hallucinate, or crash on a user’s constrained device (e.g., Intel laptop with 8GB RAM) due to differences in quantization, VRAM availability, and compute precision.

Q: Should I mock LLM calls in my tests?

A: For unit tests of UI/logic, yes. But for Integration and System tests, absolutely not. You need to verify that the actual model, running in the actual runtime (Ollama), produces valid outputs given the constraints. Mocking hides the most common failure modes of Local AI apps (timeouts, hallucinations, broken JSON).