R³: 3D Reconstruction via Relative Regression

Congrong Xu^1,2 Huachen Gao² Xingyu Chen² Yuliang Xiu² Jun Gao^1,3 Anpei Chen²

¹University of Michigan ²Westlake University ³NVIDIA

Input stream

Rendered flythrough

FPS— · VRAM—GB

Point cloud

Each scene is reconstructed online from the input stream; the rendered flythrough and point cloud correspond to the same selected sequence.

TL;DR: R³ introduces a confidence-weighted relative pose representation, enabling efficient and robust 3D reconstruction with low memory overhead — in both streaming and offline settings.

Abstract

Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call R³, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. R³ supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism.

372M

Parameters

~⅓ the size of recent 1B-class feed-forward baselines

40 FPS*

Streaming throughput

Bounded-memory, causal inference on long video streams

1 checkpoint

Two inference modes

The same weights support causal streaming and full-context offline reconstruction

* Measured on a single NVIDIA RTX PRO 6000.

How R³ Works

Predict relative poses

R³ regresses pairwise camera motion instead of absolute poses, keeping the target stable as videos grow longer.

Fuse with confidence

Each pair receives rotation and translation confidences, so reliable matches contribute more when assembling the trajectory.

Stream with memory

A bounded keyframe bank keeps useful past views, allowing new frames to reconnect to earlier observations without full-history processing.

Robustness via Confidence Gating

The same learned confidences that drive pose assembly double as an effective outlier gate: when a new frame's mean confidence against the active context falls below a calibrated baseline, R³ suppresses its pose estimate, invalidates its KV-cache entries, and skips keyframe-bank admission. This prevents motion blur, occlusions, transient objects, and sudden scene cuts from polluting the map.

Interactive Dynamic Examples

Pick a scene below to explore in 3D. Press Space to play / pause, click and drag to change viewpoint.

[Demo requires browser with WebGL2 support.]

ⓘ All examples are streaming reconstructions. Scene geometry is downsampled for faster loading. Firefox may not properly render point clouds.

Qualitative Comparison

When the camera revisits a region, R³ can place the new frame relative to retrieved keyframes from the earlier visit. This lets the trajectory re-register against existing geometry, so loops stay consistent instead of accumulating duplicated or misaligned structure.

Top row: rendered views of Ours (R³) vs. a baseline (toggle between InfiniteVGGT and TTT3R). Bottom row: synchronized point clouds. Pick a scene from the strip below.

Ours R³

InfiniteVGGT

Point cloud: Ours R³

Point cloud: InfiniteVGGT

BibTeX

@misc{xu2026r33dreconstructionrelative,
      title={$R^3$: 3D Reconstruction via Relative Regression}, 
      author={Congrong Xu and Huachen Gao and Xingyu Chen and Yuliang Xiu and Jun Gao and Anpei Chen},
      year={2026},
      eprint={2605.26519},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.26519}, 
}

R3: 3D Reconstruction via Relative Regression