Agent-Native Perception

Overview

Traditional vision systems process full frames regardless of task demands. We explore an alternative: closed-loop perception where an agent controls its own sensing—requesting crops, adjusting resolution, and terminating early when confidence is sufficient.

The camera becomes an API with calls like request_crop(region, resolution) and stop(). Each request carries computational cost. The agent learns to balance information gain against resource expenditure.

Approach

We simulate a controllable camera over static images and video. The agent never sees the full scene unless it explicitly requests it. A budget tracks pixels processed, latency incurred, and queries made. Perception is no longer free.

Early experiments suggest adaptive sensing strategies can match full-frame accuracy while processing significantly fewer pixels—often 2–3× fewer for equivalent task performance.

Demo

Agent attention over time

Key Insight

Perception should be an action, not an input. When agents control their own sensing, they naturally develop efficient strategies: focusing on task-relevant regions, stopping when confident, and avoiding wasteful full-scene processing.

Baseline vs agent-native comparison