Accelerating Transformer-Based
Monocular SLAM via Geometric Utility Scoring

Xinmiao Xiong1, Bangya Liu1, Hao Wang2, Dayou Li2, Nuo Chen2, Andrew Feng3, Mingyu Ding4, Suman Banerjee1, Yang Zhou2, Zhiwen Fan2

1Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA

2Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA

3Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA

4Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Abstract

Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post-hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine if a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. By serving as a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.

LeanGate teaser figure.

Background

Recently, 3D Geometric Foundation Models (GFMs) have emerged as robust, data-driven alternatives for visual perception. Consequently, they offer highly stable front-ends for tracking and mapping in systems like MASt3R-SfM and MASt3R-SLAM. Yet, current GFM-based SLAM systems process dense temporal streams, incurring heavy encoding and decoding costs on nearly every frame. For instance, in MASt3R-SLAM, dense feature extraction accounts for over 50% of the runtime on a 15 FPS stream. This exposes a critical system bottleneck where keyframe selection relies on post-hoc evaluation: the system must execute the computationally expensive dense geometric decoding process simply to determine if a frame actually contains novel geometry. Consequently, this architectural flow leads to late rejection and wasted compute on highly redundant frames.

Time breakdown across the SLAM pipeline.
Trajectory comparison results.

Approach

Our approach moves keyframe selection from an expensive post-hoc decision to a fast feed-forward prediction. Instead of sending every incoming frame through the full dense geometry pipeline, we learn to estimate its geometric utility in advance using a lightweight student model distilled from MASt3R-SLAM. The score reflects how much useful geometric information a frame contributes relative to the latest keyframe, capturing both reliable matching and scene coverage. By predicting this utility before reconstruction, our system can filter out low-value frames early and reserve heavy computation for frames that matter most.

Built on top of FLARE's camera-aware decoder tokens, our model leverages an internal geometric representation of image pairs and refines the predicted utility through an iterative overlap head. Training is driven by high-quality pseudo-labels generated from ScanNet++ with the original MASt3R-SLAM scoring rule, allowing the student to inherit strong geometric judgment without reproducing the full teacher pipeline. At inference time, the model scores each frame in a single forward pass and decides whether it should enter the SLAM backend, enabling a slimmer and more efficient system that cuts unnecessary computation and energy overhead while maintaining strong tracking performance.

LeanGate system overview.

SLAM Result

Core results across datasets comparing trajectory accuracy and computational efficiency. For DROID-SLAM, we report both single-GPU and parallel-mode profiling on the same device using official settings.

Dataset Model Downsample ATE [cm] ↓ Time [s] ↓ Calculations [TFLOPs] ↓
Frame Select SLAM Total
TUM RGB-D DPV-SLAM 7.6 43.79 --- --- ---
DROID-SLAM 3.8 41.78/39.56 --- --- ---
MASt3R-SLAM 3.00 74.95 --- 6698.55 6698.55
LeanGate 15.58× 2.56 18.18 532.05 461.41 993.46
EuRoC MAV DPV-SLAM 2.4 122.15 --- --- ---
DROID-SLAM 2.2 127.45/103.68 --- --- ---
MASt3R-SLAM 4.09 189.50 --- 20835.09 20835.09
LeanGate 18.60× 4.90 44.63 1592.16 1206.14 2798.31
7-Scenes DPV-SLAM 5.4 38.44 --- --- ---
DROID-SLAM 4.9 41.29/40.75 --- --- ---
MASt3R-SLAM 4.71 66.46 --- 4978.32 4978.32
LeanGate 32.26× 4.61 12.64 318.75 174.41 493.16

Reconstruction Result

Reconstruction quality under different downsampling strategies on TUM RGB-D, EuRoC MAV, and 7-Scenes. We compare LeanGate against uniform striding; metrics include Completion (Comp), Chamfer distance, and F-score at 2 cm and 5 cm. Δ% denotes change relative to the 1× baseline; LeanGate preserves quality with 16×-32× fewer frames, approaching the 2× stride baseline.

Dataset Method Downsample Comp ↓ (Δ% ↑) Chamfer ↓ (Δ% ↑) F@2cm ↑ (Δ% ↑) F@5cm ↑ (Δ% ↑)
TUM RGB-D DROID-SLAM 0.631 0.355 0.076 0.184
MASt3R-SLAM
All 0.107 0.143 0.204 0.425
Stride 2 0.129 (-20.6) 0.145 (-1.4) 0.212 (+3.9) 0.433 (+1.9)
Stride 15 15× 0.545 (-409.3) 0.393 (-174.8) 0.188 (-7.8) 0.368 (-13.4)
LeanGate 16× 0.160 (-49.5) 0.149 (-4.2) 0.202 (-1.0) 0.422 (-0.7)
EuRoC MAV DROID-SLAM 1.257 0.705 0.023 0.133
MASt3R-SLAM
All 0.271 0.274 0.030 0.221
Stride 2 0.272 (-0.4) 0.272 (+0.7) 0.031 (+3.3) 0.224 (+1.4)
Stride 15 15× 0.370 (-36.5) 0.316 (-15.3) 0.030 (+0.0) 0.206 (-6.8)
LeanGate 18× 0.348 (-28.4) 0.298 (-8.8) 0.031 (+3.3) 0.219 (-0.9)
7-Scenes DROID-SLAM 0.401 0.237 0.170 0.385
MASt3R-SLAM
All 0.149 0.144 0.263 0.476
Stride 2 0.150 (-0.7) 0.143 (+0.7) 0.272 (+3.4) 0.483 (+1.5)
Stride 15 15× 0.157 (-5.4) 0.143 (+0.7) 0.257 (-2.3) 0.470 (-1.3)
LeanGate 32× 0.140 (+6.0) 0.141 (+2.1) 0.264 (+0.4) 0.493 (+3.6)

Qualitative Results

Qualitative 3D trajectory comparisons on TUM-RGBD (fr1-teddy, fr1-room), 7-Scenes (heads-seq01), and EuRoC (MH01 easy, MH03 medium, MH04 difficult). Black denotes ground truth; orange shows Sim(3)-aligned estimates from LeanGate-filtered RGB streams; blue shows Sim(3)-aligned stride-15 results at an equivalent or smaller downsampling rate, illustrating our method's tracking consistency under aggressive frame pruning.

Qualitative 3D trajectory comparisons across TUM-RGBD, 7-Scenes, and EuRoC.

Acknowledgement. We adapt this webpage template from DreamBooth.