Accelerating Transformer-Based
Monocular SLAM via Geometric Utility Scoring

Xinmiao Xiong¹, Bangya Liu¹, Hao Wang², Dayou Li², Nuo Chen², Andrew Feng³, Mingyu Ding⁴, Suman Banerjee¹, Yang Zhou², Zhiwen Fan²

¹Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA

²Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA

³Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA

⁴Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Paper Code Hugging Face BibTeX

Abstract

Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post-hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine if a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. By serving as a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines.

Background

Recently, 3D Geometric Foundation Models (GFMs) have emerged as robust, data-driven alternatives for visual perception. Consequently, they offer highly stable front-ends for tracking and mapping in systems like MASt3R-SfM and MASt3R-SLAM. Yet, current GFM-based SLAM systems process dense temporal streams, incurring heavy encoding and decoding costs on nearly every frame. For instance, in MASt3R-SLAM, dense feature extraction accounts for over 50% of the runtime on a 15 FPS stream. This exposes a critical system bottleneck where keyframe selection relies on post-hoc evaluation: the system must execute the computationally expensive dense geometric decoding process simply to determine if a frame actually contains novel geometry. Consequently, this architectural flow leads to late rejection and wasted compute on highly redundant frames.

Time breakdown across the SLAM pipeline.

Approach

Our approach moves keyframe selection from an expensive post-hoc decision to a fast feed-forward prediction. Instead of sending every incoming frame through the full dense geometry pipeline, we learn to estimate its geometric utility in advance using a lightweight student model distilled from MASt3R-SLAM. The score reflects how much useful geometric information a frame contributes relative to the latest keyframe, capturing both reliable matching and scene coverage. By predicting this utility before reconstruction, our system can filter out low-value frames early and reserve heavy computation for frames that matter most.

Built on top of FLARE's camera-aware decoder tokens, our model leverages an internal geometric representation of image pairs and refines the predicted utility through an iterative overlap head. Training is driven by high-quality pseudo-labels generated from ScanNet++ with the original MASt3R-SLAM scoring rule, allowing the student to inherit strong geometric judgment without reproducing the full teacher pipeline. At inference time, the model scores each frame in a single forward pass and decides whether it should enter the SLAM backend, enabling a slimmer and more efficient system that cuts unnecessary computation and energy overhead while maintaining strong tracking performance.

SLAM Result

Core results across datasets comparing trajectory accuracy and computational efficiency. For DROID-SLAM, we report both single-GPU and parallel-mode profiling on the same device using official settings.

Dataset	Model	Downsample	ATE [cm] ↓	Time [s] ↓	Calculations [TFLOPs] ↓
Dataset	Model	Downsample	ATE [cm] ↓	Time [s] ↓	Frame Select	SLAM	Total
TUM RGB-D	DPV-SLAM	1×	7.6	43.79	---	---	---
	DROID-SLAM	1×	3.8	41.78/39.56	---	---	---
	MASt3R-SLAM	2×	3.00	74.95	---	6698.55	6698.55
	LeanGate	15.58×	2.56	18.18	532.05	461.41	993.46
EuRoC MAV	DPV-SLAM	2×	2.4	122.15	---	---	---
	DROID-SLAM	2×	2.2	127.45/103.68	---	---	---
	MASt3R-SLAM	2×	4.09	189.50	---	20835.09	20835.09
	LeanGate	18.60×	4.90	44.63	1592.16	1206.14	2798.31
7-Scenes	DPV-SLAM	2×	5.4	38.44	---	---	---
	DROID-SLAM	2×	4.9	41.29/40.75	---	---	---
	MASt3R-SLAM	2×	4.71	66.46	---	4978.32	4978.32
	LeanGate	32.26×	4.61	12.64	318.75	174.41	493.16

Reconstruction Result

Reconstruction quality under different downsampling strategies on TUM RGB-D, EuRoC MAV, and 7-Scenes. We compare LeanGate against uniform striding; metrics include Completion (Comp), Chamfer distance, and F-score at 2 cm and 5 cm. Δ% denotes change relative to the 1× baseline; LeanGate preserves quality with 16×-32× fewer frames, approaching the 2× stride baseline.

Dataset	Method	Downsample	Comp ↓ (Δ% ↑)	Chamfer ↓ (Δ% ↑)	F@2cm ↑ (Δ% ↑)	F@5cm ↑ (Δ% ↑)
TUM RGB-D	DROID-SLAM	1×	0.631	0.355	0.076	0.184
	MASt3R-SLAM
	All	1×	0.107	0.143	0.204	0.425
	Stride 2	2×	0.129 (-20.6)	0.145 (-1.4)	0.212 (+3.9)	0.433 (+1.9)
	Stride 15	15×	0.545 (-409.3)	0.393 (-174.8)	0.188 (-7.8)	0.368 (-13.4)
	LeanGate	16×	0.160 (-49.5)	0.149 (-4.2)	0.202 (-1.0)	0.422 (-0.7)
EuRoC MAV	DROID-SLAM	2×	1.257	0.705	0.023	0.133
	MASt3R-SLAM
	All	1×	0.271	0.274	0.030	0.221
	Stride 2	2×	0.272 (-0.4)	0.272 (+0.7)	0.031 (+3.3)	0.224 (+1.4)
	Stride 15	15×	0.370 (-36.5)	0.316 (-15.3)	0.030 (+0.0)	0.206 (-6.8)
	LeanGate	18×	0.348 (-28.4)	0.298 (-8.8)	0.031 (+3.3)	0.219 (-0.9)
7-Scenes	DROID-SLAM	2×	0.401	0.237	0.170	0.385
	MASt3R-SLAM
	All	1×	0.149	0.144	0.263	0.476
	Stride 2	2×	0.150 (-0.7)	0.143 (+0.7)	0.272 (+3.4)	0.483 (+1.5)
	Stride 15	15×	0.157 (-5.4)	0.143 (+0.7)	0.257 (-2.3)	0.470 (-1.3)
	LeanGate	32×	0.140 (+6.0)	0.141 (+2.1)	0.264 (+0.4)	0.493 (+3.6)

Qualitative Results

Qualitative 3D trajectory comparisons on TUM-RGBD (fr1-teddy, fr1-room), 7-Scenes (heads-seq01), and EuRoC (MH01 easy, MH03 medium, MH04 difficult). Black denotes ground truth; orange shows Sim(3)-aligned estimates from LeanGate-filtered RGB streams; blue shows Sim(3)-aligned stride-15 results at an equivalent or smaller downsampling rate, illustrating our method's tracking consistency under aggressive frame pruning.

Qualitative 3D trajectory comparisons across TUM-RGBD, 7-Scenes, and EuRoC.

Acknowledgement. We adapt this webpage template from DreamBooth.