Each scene is reconstructed online from the input stream; the rendered flythrough and point cloud correspond to the same selected sequence.
Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call R3, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. R3 supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism.
* Measured on a single NVIDIA RTX PRO 6000.
R3 regresses pairwise camera motion instead of absolute poses, keeping the target stable as videos grow longer.
Each pair receives rotation and translation confidences, so reliable matches contribute more when assembling the trajectory.
A bounded keyframe bank keeps useful past views, allowing new frames to reconnect to earlier observations without full-history processing.
The same learned confidences that drive pose assembly double as an effective outlier gate: when a new frame's mean confidence against the active context falls below a calibrated baseline, R3 suppresses its pose estimate, invalidates its KV-cache entries, and skips keyframe-bank admission. This prevents motion blur, occlusions, transient objects, and sudden scene cuts from polluting the map.
Pick a scene below to explore in 3D. Press Space to play / pause, click and drag to change viewpoint.
When the camera revisits a region, R3 can place the new frame relative to retrieved keyframes from the earlier visit. This lets the trajectory re-register against existing geometry, so loops stay consistent instead of accumulating duplicated or misaligned structure.
Top row: rendered views of Ours (R3) vs. a baseline (toggle between InfiniteVGGT and TTT3R). Bottom row: synchronized point clouds. Pick a scene from the strip below.
@misc{xu2026r3,
title={$R^3$: 3D Reconstruction via Relative Regression},
author={Congrong Xu and Huachen Gao and Xingyu Chen and Yuliang Xiu and Jun Gao and Anpei Chen},
year={2026},
}