✨ Champion of VLN-PE in InternUtopia and Real World Challenge, IROS 2025

LatentPilot

Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao^🍕; Lei Chen^🍕; Mingfei Han^🌭; Changlin Li^🍔; Dong An^⭐; Yuqiang Yang^🌈; Zhihui Li^🍕; Xiaojun Chang^🍕

^🍕University of Science and Technology of China | ^🌭MBZUAI | ^🍔Stanford University | ^⭐Amap, Alibaba Group | ^🌈Shanghai AI Laboratory

Paper Code Models Comming Soon

LatentPilot learns to dream ahead before acting: it leverages future observations during training to internalize action-conditioned visual dynamics, while requiring no future frames at inference time.

Overview

A future-aware navigation paradigm that learns latent visual reasoning from action-induced scene dynamics.

Existing vision-and-language navigation models mainly reason over past and current observations, while largely overlooking how actions reshape future views. LatentPilot addresses this limitation by learning action-conditioned visual dynamics from future observations during training. Its latent tokens evolve across steps, serve as both output and next-step input, and enable the agent to reason about what the scene will look like after acting.

Dream Ahead Latent Visual Reasoning Flywheel-Style Training No Future Frames at Inference Real-Robot Validation

Scene-Aware Reasoning

LatentPilot learns to anticipate near-future visual changes, helping the agent understand not only what it sees now, but also how the world is likely to evolve under candidate actions.

Flywheel-Style Training

The model is iteratively retrained on on-policy trajectories to better match the agent's behavior distribution, with expert takeover used as a safeguard when the policy deviates too much.

Latent Tokens Across Time

Visual latent tokens are carried across steps in a continuous latent space, enabling compact memory, global attention, and future-aware decision making without explicit latent supervision.

Strong Performance in Simulation and Reality

Experiments on R2R-CE, RxR-CE, and R2R-PE demonstrate new state-of-the-art results, while real-robot tests highlight improved understanding of environment-action dynamics in diverse scenes.

Video Showcase

Multi-view demonstrations of LatentPilot in navigation scenarios.

Instruction

Exit the bedroom and enter the bedroom on the right, go through the bedroom and go right into the glass doorway, take a step outside and then stop.

Instruction

Go straight and when you get to a large room turn right and head down the hallway then turn right when you get to a white and blue painting on the wall. Then turn left into the room on the left and wait at the entrance.

First-Person View

Egocentric observation stream showing what the agent sees while navigating and reasoning over future consequences.

Third-Person View

External visualization of the navigation trajectory, making the policy behavior and path execution easier to inspect.

Abstract

Future-aware VLN through latent visual reasoning.

Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can “imagine” the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices.

Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent’s behavior distribution, with an expert takeover triggered when the agent deviates excessively.

LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to “dream ahead” and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot’s superior understanding of environment-action dynamics in scene.

BibTeX

Citation information will be updated after release.

@article{hao2026latentpilot,
  title   = {LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning},
  author  = {Hao, Haihong and Chen, Lei and Han, Mingfei and Li, Changlin and An, Dong and Yang, Yuqiang and Li, Zhihui and Chang, Xiaojun},
  year    = {2026}
}

Thank you (.❛ ᴗ ❛.)