close
Skip to content

desktop: trim old screenshot images from pi-mono floating-bar session history #6967

@beastoin

Description

@beastoin

Problem

The floating-bar chat attaches a fresh ~500 KB WebP screenshot to every user message. Pi-mono stores the whole conversation as a message list and re-POSTs every prior message (including images) to /v2/chat/completions on every turn. After enough turns, the accumulated images bloat the request body.

Symptom (before raising the axum limit in #6965):

[app] ScreenCaptureManager: Screenshot captured 5120x2880, WebP 502 KB
[agent] Reusing pi-mono session: pi-session-1 (key=floating)
[agent] Pi-mono: including screenshot image in prompt (image/webp)
[pi-mono] turn_end ERROR: 413 Failed to buffer the request body: length limit exceeded

Math per turn (N prior images):

  • per-image: 502 KB × 4/3 ≈ 669 KB base64
  • body at turn 3: ~2.06 MB → exceeds axum's 2 MB default → 413

Interim mitigation (shipped)

PR #6965 raises the backend request body limit on /v2/chat/completions from 2 MB to 16 MB. That buys roughly 20 accumulated screenshots of headroom — enough for any realistic floating-bar session — but doesn't fix unbounded growth.

Proposed fix

In desktop/agent/src/adapters/pi-mono.ts, before serializing messages to the provider, walk the conversation and replace older image content blocks with a short text placeholder. Keep only the most recent image (that's the current screen state — the only one the model actually needs).

Pseudocode:

function stripOldImages(messages) {
  // Find the index of the LAST message that contains an image block
  const lastImageIdx = messages.findLastIndex(msg =>
    Array.isArray(msg.content) && msg.content.some(b => b.type === "image")
  );
  return messages.map((msg, i) => {
    if (i === lastImageIdx) return msg; // keep the most recent
    if (!Array.isArray(msg.content)) return msg;
    return {
      ...msg,
      content: msg.content.map(b =>
        b.type === "image"
          ? { type: "text", text: "[earlier screenshot omitted]" }
          : b
      ),
    };
  });
}

This:

  • Drops redundant visual data that the model doesn't need (the model is looking at the current screen)
  • Preserves full conversational text context
  • Keeps steady-state body size roughly constant regardless of turn count
  • Doesn't touch Swift — image quality stays full

Acceptance

  • After N ≥ 10 floating-bar turns in a single session, body size remains well under 2 MB
  • Model still responds coherently to follow-ups ("you said earlier that...") — text history intact
  • No regression on main-chat or ACP flows (only applies to pi-mono)
  • Unit test covering the replacement logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions