Blog / Testing

Why Frontend Testing Is Broken and What In-Vivo Testing Fixes

The entire frontend testing ecosystem falls into two buckets that have never been unified. State-aware tools can't run tests. Testing tools can't see state. The bugs that matter most live in the gap between them.

Why frontend testing is broken cover

The test that lies to you

A developer writes a Playwright E2E test for a chat application. The test sends a message, waits for a DOM update, and asserts that the response text appeared on screen. The test passes. CI is green.

In production, a user switches conversations while a response is still streaming. The stream continues writing to the old session's message array, but the UI is now rendering the new session. When the user switches back, they find the AI response attached to the wrong conversation. Data has been silently corrupted.

Playwright never saw it. It verified that text appeared in a DOM element. It has no concept of "session IDs," "store state," or "which conversation this message belongs to." The test asked "did text appear?" and the answer was yes. The test passed. The bug shipped.

This is not a failure of Playwright. It's a failure of the testing architecture. Playwright is doing exactly what it was designed to do — drive a browser from outside and verify DOM output. The problem is that an entire class of bugs exists below the DOM surface, in the state management layer, and no testing tool can see it.

Two camps, one gap

Every frontend testing tool falls into one of two architectural camps. Understanding this split is the key to understanding why your tests pass but your app still breaks.

Camp 1: Inside the process, fake environment

Vitest, Jest, Bun test, and React Testing Library run in isolated Node.js processes with jsdom simulating the browser environment. jsdom is not a real browser. It doesn't have a real rendering engine, real layout, or real async scheduling. State stores are mocked or re-created from scratch for each test.

This architecture is excellent for testing component rendering logic and pure functions. It's fast, deterministic, and isolated. But the mocked stores eliminate the exact timing that causes real bugs. When you mock a Zustand store, you're telling the test "assume the store behaves perfectly." The bugs we care about are the ones where the store does not behave perfectly — where concurrent writes race, where subscriptions fire in unexpected order, where Immer patches conflict.

Camp 2: Real environment, outside the process

Playwright communicates with the browser over the Chrome DevTools Protocol from a completely separate Node.js process. Cypress runs in an iframe alongside the application. Selenium uses WebDriver over HTTP.

These tools drive a real browser with real rendering. They can click buttons, fill forms, navigate pages, and read DOM text. But they're blind to internal state. Playwright cannot call useChatStore.getState(). It cannot subscribe to store mutations. It sees the output of state changes (DOM updates) but not the state changes themselves.

Cypress has a partial exception: because it runs in an iframe in the same browser, it can access cy.window().its('store') for one-time reads. But there's a fundamental difference between a one-time read and a real-time subscription:

// Cypress — one-time snapshot read via window
cy.window().its('__store__').then((store) => {
  const state = store.getState();
  expect(state.messages).to.have.length(2);
});
// Problem: This reads state at ONE point in time.
// It cannot observe the sequence of mutations that led here.
// It cannot detect if a race condition occurred and self-corrected.
// It cannot catch a mutation that was applied and then overwritten.
// Mosaic — real-time subscription in the same JS context
const recorder = createStateRecorder(useChatStore);
recorder.start();

await sendMessageAndWait('Hello');

const mutations = recorder.stop();
// We see EVERY mutation: message:add, stream:start, stream:chunk (x47),
// stream:complete, message:finalize.
// If a race condition caused a mutation to be applied and overwritten,
// it's in the log. If session IDs were remapped mid-stream, we see it.

A one-time read tells you where state ended up. A real-time subscription tells you how it got there. For timing bugs, the journey is the entire point.

The Bertolino taxonomy

Bertolino et al.'s 2021 ACM Computing Surveys paper established a taxonomy for testing approaches based on where and how tests execute:

  • In-vitro — Tests run in an isolated lab environment with mocked dependencies (Vitest, Jest).
  • Ex-vivo — Tests drive a real environment from outside the process boundary (Playwright, Cypress, Selenium).
  • Offline field testing — Analysis of production logs and traces after the fact (LogRocket, Sentry).
  • Online field testing / in-vivo — Tests execute inside the live application during operation.

Every major frontend testing tool falls into the first two categories. The online/in-vivo category — the one that catches timing bugs, state corruption, and race conditions — has been empty for JavaScript. Until Mosaic.

The bugs that live in the gap

The bugs that no existing tool catches fall into four categories. Each one requires both internal state visibility and real async execution to detect — exactly the combination that the two-camp split makes impossible.

State mutation race conditions

Two or more async operations write to the same store concurrently. A streaming AI response writes chunks to messages[sessionId] while a user action triggers a store update to the same path. In Zustand with Immer, this means two produce() calls racing — one may overwrite the other's patches.

Why mocks can't reproduce it: Mocked stores execute synchronously. The race condition only occurs when real async operations (API calls, WebSocket messages, IPC responses) interleave on the real event loop.

Why external drivers can't observe it: Playwright sees the DOM after all mutations have settled. If two racing writes produce the same final DOM output (which they often do — until they don't), the bug is invisible from outside.

Session lifecycle bugs

Session ID remapping during active operations. Creating a new conversation while a previous one is still streaming. Fork points where a conversation splits into branches. Ghost sessions that persist in the store after deletion.

Why mocks can't reproduce it: Session lifecycle bugs emerge from the interaction between multiple async operations across multiple store slices. Mocking one store slice in isolation removes the cross-slice interaction that causes the bug.

Why external drivers can't observe it: Session IDs are internal state. The DOM renders conversation content, not session metadata. A user sees "messages in a conversation" — they don't see which internal session ID those messages are attached to.

IPC ordering issues

In Electron or Tauri applications, commands sent to the backend may return responses out of order. If the frontend sends createSession followed by sendMessage, the sendMessage response may arrive before createSession completes, causing the message to be written to a non-existent session.

Why mocks can't reproduce it: Mocked IPC channels return responses synchronously or with deterministic delays. Real IPC is non-deterministic.

Why external drivers can't observe it: IPC is internal to the application process. Playwright cannot intercept or observe Tauri command invocations.

Checkpoint corruption

Applications that support undo/redo, conversation rewind, or branching need to merge checkpoint data with live state. If a rewind is triggered while a stream is active, the checkpoint restore may conflict with in-flight mutations. Stale data gets merged with live data, producing a state that is neither the checkpoint nor the current state.

Why mocks can't reproduce it: The conflict requires real concurrent writes — real stream data arriving while a real checkpoint restore is in progress.

Why external drivers can't observe it: The corruption is in the state shape, not the DOM. The UI may render correctly from the corrupted state for most interactions, only failing on specific operations later.

What the research says

The idea of running tests inside a live application is not new. What's new is doing it for frontend JavaScript. The academic lineage is clear — and so is the gap.

Murphy and Kaiser: The Invite framework (2008–2010)

Christian Murphy and Gail Kaiser at Columbia University coined "in-vivo testing" and built Invite, a framework for executing JUnit tests inside running Java applications. Their key insight was that some bugs only manifest under real workload conditions — real data, real concurrency, real resource contention. Lab testing (in-vitro) cannot reproduce these conditions.

Invite proved the architecture was viable: test code injected into the production runtime, executing alongside real operations, with structured pass/fail reporting. But it was Java-only, focused on server-side applications, and never extended to frontend or JavaScript contexts.

Bertolino et al.: The Groucho framework (2023)

Antonia Bertolino's team at CNR-ISTI in Pisa advanced in-vivo testing with Groucho, introducing selective object isolation for Java applications. Groucho can run tests on specific objects within the live application without affecting overall application behavior. Their 2021 ACM Computing Surveys paper surveyed 80 field-testing papers across the major databases and established the in-vitro/ex-vivo/in-vivo taxonomy we reference throughout this post.

The survey's finding: zero approaches targeting frontend JavaScript. Every in-vivo testing framework in the literature targets Java, .NET, or C/C++. The web frontend — arguably the most widely deployed software category on the planet — has no in-vivo testing capability.

fast-check: Race condition detection, isolated

fast-check by Nicolas Dubien is a property-based testing library for JavaScript that includes a scheduler for detecting race conditions. The scheduler can model concurrent async operations and explore different interleavings. It's the closest existing JavaScript tool to race condition detection.

But fast-check runs in an isolated Node.js process, not inside the application. It models concurrency abstractly rather than observing real concurrency. You define the async operations to schedule; fast-check explores orderings. This catches bugs in isolated async logic, but not bugs caused by the interaction between your store, your UI, your IPC layer, and your backend — all operating simultaneously in the real application.

Dear ImGui Test Engine: Proof of architecture

Dear ImGui Test Engine is the most architecturally similar tool to Mosaic. It runs test code inside the Dear ImGui C++ application, with direct access to the UI state tree, real-time input simulation, and structured test execution. It scores 5/7 on our criteria.

Dear ImGui Test Engine proves the architecture works. Tests that run inside the application, with direct state access, catch bugs that external drivers miss. The limitation is purely language and platform: C++ only, game engine UIs only. Nobody built the equivalent for JavaScript web applications.

How Mosaic closes the gap

Mosaic is the first in-app testing framework for frontend JavaScript applications. The architecture is straightforward: test code runs in the same JavaScript context as the application. No separate process, no iframe, no Chrome DevTools Protocol, no jsdom.

Architecture

Mosaic's test runtime is loaded into the application's module graph. It shares the same window, the same event loop, the same module scope. When a test imports useChatStore, it gets the exact same Zustand store instance that the UI components render from. When a test calls sendMessage(), it triggers the same IPC call that the user's click handler triggers.

Store integration

Mosaic provides direct subscribe() access to any state management library that supports subscriptions. For Zustand:

// Direct subscription — every mutation, in real time
const unsubscribe = useChatStore.subscribe((state, prevState) => {
  // Called on EVERY state change
  // state is the new state, prevState is the previous state
  // Diff them to see exactly what changed
});

For Redux, it's store.subscribe(). For MobX, it's autorun() or reaction(). The pattern is the same: the test hooks into the store's native change notification mechanism and observes every mutation as it happens.

Mutation tracing

The createStateRecorder() utility wraps store subscriptions into a structured log:

const recorder = createStateRecorder(useChatStore);
recorder.start();

// ... run test operations ...

const mutations = recorder.stop();
// mutations: Array<{
//   timestamp: number,
//   type: string,        // inferred action type
//   path: string[],      // state path that changed
//   prev: unknown,       // previous value
//   next: unknown,       // new value
//   stackTrace: string,  // call stack at mutation time
// }>

Every mutation is logged with a timestamp, the state path that changed, the previous and new values, and the call stack at mutation time. This is how Mosaic detects race conditions: if two mutations write to the same state path within a narrow time window, and the second write overwrites the first without incorporating its changes, that's a data race.

Zero-mock execution

Because Mosaic runs inside the application, it uses the application's real code paths for everything. API calls go to real endpoints. IPC commands invoke real backend handlers. File system operations use real files. WebSocket connections are real WebSocket connections.

This is not a philosophical stance against mocking. It's a practical observation: the bugs we care about only manifest when real async operations interleave on the real event loop. Mocks eliminate that interleaving by definition.

Multi-phase test sequences

Mosaic's step runner supports multi-phase test sequences with bail-on-failure, per-step timing, and structured results:

const test = mosaic.define('full-lifecycle', async (t) => {
  const recorder = createStateRecorder(useChatStore);

  // Phase 1: Setup and send
  await t.step('create session', async () => {
    const { createSession } = useChatStore.getState();
    const sessionId = createSession();
    t.assert(typeof sessionId === 'string', 'session ID is a string');
    t.context.sessionId = sessionId;
  });

  await t.step('send message', async () => {
    recorder.start();
    const { sendMessage } = useChatStore.getState();
    sendMessage('Explain closures in JavaScript');
  });

  // Phase 2: Observe streaming
  await t.step('wait for response', async () => {
    await waitForAgentComplete(useChatStore, { timeout: 30_000 });
    const { messages } = useChatStore.getState();
    t.assert(messages.length >= 2, 'user + agent messages exist');

    // Verify session consistency
    const allSameSession = messages.every(
      m => m.sessionId === t.context.sessionId
    );
    t.assert(allSameSession, 'all messages belong to same session');
  });

  // Phase 3: Rewind and verify integrity
  await t.step('rewind', async () => {
    const { rewindToCheckpoint } = useChatStore.getState();
    rewindToCheckpoint(0);
  });

  await t.step('verify rewind state', async () => {
    const { messages, checkpoints } = useChatStore.getState();
    t.assert(messages.length === 1, 'only user message remains');

    const mutations = recorder.stop();
    const rewindMutation = mutations.find(m => m.type === 'REWIND');
    t.assert(rewindMutation != null, 'rewind mutation was recorded');

    // Verify no orphaned data
    const postRewindMutations = mutations.filter(
      m => m.timestamp > (rewindMutation?.timestamp ?? 0)
    );
    const hasStaleWrite = postRewindMutations.some(
      m => m.path.includes('messages') && m.type !== 'REWIND'
    );
    t.assert(!hasStaleWrite, 'no stale writes after rewind');
  });
});

Each step runs sequentially. If any step fails, the test bails immediately with a structured report showing which step failed, what the assertion was, and the full mutation log up to that point.

CapabilityMosaicPlaywrightCypressVitestDevTools
In-app executionYesNoIframeNoYes
Direct store subscribeYesNoSnapshotMockedYes
Mutation tracingYesNoNoNoYes
Structured runnerYesYesYesYesNo
Zero mocksYesYesYesNoN/A
Race detectionYesNoNoNoNo
Multi-phase sequencesYesYesYesYesNo

Where Mosaic is today

Mosaic was born inside Recursive Labs while building Orbit, our AI code editor. We kept hitting bugs that no existing tool could catch: session ID remapping mid-stream, sidebar entries vanishing during streaming, checkpoint mutations being silently dropped. We built Mosaic because we had no alternative.

After months of internal use across our product suite (Orbit, Phractal, and Mosaic itself), we're packaging it into a standalone framework. The core runtime — in-app execution, store subscription, mutation tracing, step runner — is stable and battle-tested. We're currently building the public API surface, documentation at mosaic.sh/docs, and framework adapters.

For the product introduction and the story of how we built it, read Introducing Mosaic.

Beta access is coming soon.

Join the waitlist to be the first to use the in-app testing framework that sees what Playwright, Cypress, and Vitest can't.

Request early access

FAQ

Your tests verify DOM output or mocked state — not the internal state transitions that cause real bugs. Race conditions, session lifecycle errors, and checkpoint corruption happen below the DOM surface, in the state management layer. Tools that operate from outside the process (Playwright) or with mocked state (Vitest) cannot observe these bugs.

No. Playwright communicates with the browser over the Chrome DevTools Protocol from a separate Node.js process. It can execute JavaScript in the page context via page.evaluate(), but it cannot import React components, subscribe to store mutations, or maintain a persistent connection to the application's module graph.

Mosaic runs inside your application and imports your Zustand store directly — the same instance your UI renders from. Call getState() to read current state, call subscribe() to observe every mutation in real time. No mock setup, no state re-creation.

In-vivo testing means executing tests inside the live application, sharing the same process, memory, and runtime context. The term was coined by Murphy and Kaiser at Columbia University in 2008 for Java applications. Mosaic brings in-vivo testing to frontend JavaScript for the first time.

E2E testing drives the application from outside — a separate process clicks buttons and reads DOM text. In-app testing runs inside the application — test code shares the same JavaScript context and can directly access stores, subscribe to mutations, and observe async interleaving. E2E tests verify user-facing behavior. In-app tests verify internal state integrity.

Race conditions in React applications typically involve concurrent async operations writing to the same state. Traditional tools can't catch them because mocks eliminate real timing and external drivers can't observe internal state. Mosaic's mutation tracing records every state change with timestamps, detecting when concurrent writes produce conflicting mutations.

Cypress has a marginal advantage: because it runs in an iframe, it can access window.store for one-time state reads. But one-time reads cannot detect race conditions or observe mutation sequences. Both Cypress and Playwright are fundamentally limited by operating outside the application's module graph.

Vitest and Jest test Zustand stores in isolation with mocked environments. Mosaic tests Zustand stores inside the live application with real-time subscription to every mutation. For unit testing store logic, use Vitest. For testing store behavior under real application conditions, use Mosaic.

Yes. Mosaic runs inside your application and uses your application's real code paths. API calls hit real endpoints, IPC commands invoke real backend handlers, WebSocket connections are real connections. This is how Mosaic catches timing bugs, serialization errors, and response ordering issues that only appear with real backends.

Mutation tracing is the practice of recording every state change in your application's stores with timestamps, action types, before/after values, and call stacks. Mosaic's createStateRecorder() provides this capability, enabling tests to assert not just on final state but on the complete sequence of mutations that produced it.


This post includes references to academic research including Murphy and Kaiser (2008), Bertolino et al. (2021), and the fast-check property-based testing library by Nicolas Dubien.