Overview
Traditional vision systems process full frames regardless of task demands. We explore an alternative: closed-loop perception where an agent controls its own sensing—requesting crops, adjusting resolution, and terminating early when confidence is sufficient.
The camera becomes an API with calls like request_crop(region, resolution) and stop(). Each request carries computational cost. The agent learns to balance information gain against resource expenditure.
Approach
We simulate a controllable camera over static images and video. The agent never sees the full scene unless it explicitly requests it. A budget tracks pixels processed, latency incurred, and queries made. Perception is no longer free.
Early experiments suggest adaptive sensing strategies can match full-frame accuracy while processing significantly fewer pixels—often 2–3× fewer for equivalent task performance.
Demo
Key Insight
Perception should be an action, not an input. When agents control their own sensing, they naturally develop efficient strategies: focusing on task-relevant regions, stopping when confident, and avoiding wasteful full-scene processing.