R3: 3D Reconstruction via Relative Regression

Congrong Xu1,2     Huachen Gao2     Xingyu Chen2     Yuliang Xiu2     Jun Gao1     Anpei Chen2
1University of Michigan      2Westlake University
Input stream
Rendered flythrough
FPS · VRAMGB
Point cloud

Each scene is reconstructed online from the input stream; the rendered flythrough and point cloud correspond to the same selected sequence.

TL;DR: R3 introduces a confidence-weighted relative pose representation, enabling efficient and robust 3D reconstruction with low memory overhead — in both streaming and offline settings.

Abstract

Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call R3, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. R3 supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism.

372M
Parameters
~⅓ the size of recent 1B-class feed-forward baselines
40 FPS*
Streaming throughput
Bounded-memory, causal inference on long video streams
1 checkpoint
Two inference modes
The same weights support causal streaming and full-context offline reconstruction

* Measured on a single NVIDIA RTX PRO 6000.

How R3 Works

01
Predict relative poses

R3 regresses pairwise camera motion instead of absolute poses, keeping the target stable as videos grow longer.

02
Fuse with confidence

Each pair receives rotation and translation confidences, so reliable matches contribute more when assembling the trajectory.

03
Stream with memory

A bounded keyframe bank keeps useful past views, allowing new frames to reconnect to earlier observations without full-history processing.

Robustness via Confidence Gating

The same learned confidences that drive pose assembly double as an effective outlier gate: when a new frame's mean confidence against the active context falls below a calibrated baseline, R3 suppresses its pose estimate, invalidates its KV-cache entries, and skips keyframe-bank admission. This prevents motion blur, occlusions, transient objects, and sudden scene cuts from polluting the map.

Interactive Dynamic Examples

Pick a scene below to explore in 3D. Press Space to play / pause, click and drag to change viewpoint.

[Demo requires browser with WebGL2 support.]
Loading...
Points: Cur size: Multi size:
Frusta:
Frusta size:
Orig. video:
snowboard drift-straight motocross-bumps drift-turn longboard
ⓘ  All examples are streaming reconstructions. Scene geometry is downsampled for faster loading. Firefox may not properly render point clouds.

Qualitative Comparison

When the camera revisits a region, R3 can place the new frame relative to retrieved keyframes from the earlier visit. This lets the trajectory re-register against existing geometry, so loops stay consistent instead of accumulating duplicated or misaligned structure.

Top row: rendered views of Ours (R3) vs. a baseline (toggle between InfiniteVGGT and TTT3R). Bottom row: synchronized point clouds. Pick a scene from the strip below.

Ours R3
InfiniteVGGT
Point cloud: Ours R3
Point cloud: InfiniteVGGT

BibTeX

@misc{xu2026r3,
      title={$R^3$: 3D Reconstruction via Relative Regression},
      author={Congrong Xu and Huachen Gao and Xingyu Chen and Yuliang Xiu and Jun Gao and Anpei Chen},
      year={2026},
}