Gaurav Sarma

Take a photo of a coffee mug on your desk. Now look at that photo. Can you tell whether the mug is 30 centimeters away or 3 meters away? You probably can, because you recognize what a mug looks like and roughly how big they are. But a computer just sees a grid of pixel values. It has no concept of physical distance. A toy car 1 meter away and a real car 100 meters away can produce nearly identical images.

This is the fundamental challenge for any robot that needs to navigate using a camera. It needs to know not just what things are, but how far away they are. The obvious solution is LiDAR, a sensor that fires laser beams and measures how long they take to bounce back. LiDAR gives you precise distances to everything in the scene. It also costs anywhere from $5,000 to $50,000, draws significant power, and adds bulk. A camera costs $30.

A recent open-source project demonstrated that a single camera can match LiDAR navigation performance in most off-road scenarios. The trick is combining two complementary systems: an AI model that estimates depth from images, and a tracking algorithm that provides real-world scale. Neither works alone. Together, they give you dense, metric depth from a single lens.

The Problem

A robot driving through a forest needs to answer three questions on every frame of video, roughly 10 to 30 times per second:

What is around me? Trees, rocks, paths, grass
How far away is each thing? That tree is 3 meters away, that rock is 7 meters away
Where can I drive? The ground ahead is flat and obstacle-free for 5 meters, then there is a rock to the left

Question 1 is a perception problem. Questions 2 and 3 are geometry problems. You need 3D information about the world, and a 2D image does not contain it directly.

LiDAR solves this by brute force. A LiDAR sensor spins a laser around, measures the return time for each beam, and produces a "point cloud": a set of 3D coordinates (x, y, z) for every surface the laser hits. A high-end unit produces hundreds of thousands of points per scan. The robot gets an accurate 3D model of its surroundings every few milliseconds.

A single camera gives you none of this. The image is a 2D projection of a 3D world. The projection destroys depth information. Every pixel tells you what color something is and what direction it is in, but not how far away it is.

This is called scale ambiguity. Think of it like a function that is not invertible:

project(3D world) → 2D image     // this works
invert(2D image) → 3D world      // this is impossible without extra information

Multiple 3D scenes can produce the exact same 2D image. A hallway that is 2 meters wide and 10 meters long looks identical to a hallway that is 4 meters wide and 20 meters long if you scale everything proportionally. The projection is the same.

So how do you recover depth from a single camera? You need two ingredients, and the key insight is that each one has exactly what the other is missing.

Prerequisites

Basic programming familiarity (variables, functions, arrays). No computer vision experience needed.
A rough mental model of what a 2D image is: a grid of pixels, each with color values
Comfort with simple arithmetic (division, averages)

That is it. Everything else is explained from scratch.

Technical Decisions

Ingredient 1: AI Depth Estimation (Dense but Scaleless)

Close one eye and look around the room. You can still judge depth reasonably well, even without stereo vision. How? Your brain uses cues it has learned from a lifetime of visual experience:

Relative size: A person who appears small in your field of view is probably far away
Occlusion: If object A blocks part of object B, A is in front
Texture gradient: The floor tiles get denser as they recede into the distance
Atmospheric haze: Distant mountains look hazier and bluer than nearby trees
Perspective lines: Parallel lines converge toward a vanishing point

An AI model called Depth Anything V2 has learned these same cues by training on millions of images paired with depth information. You feed it any image, and it outputs a depth map: a grayscale image where every pixel's brightness represents its estimated distance. Bright = far, dark = close (or vice versa depending on the convention).

Input:    A photo of a forest path
Output:   A map where every pixel has a depth VALUE

           Nearby tree trunk  →  depth value: 0.15
           Ground 5m ahead   →  depth value: 0.42
           Distant treeline   →  depth value: 0.91

The catch: these values are relative, not metric. The model tells you "this pixel is about 3 times farther than that pixel," but it does not tell you "this pixel is 4.7 meters away." The values have no physical unit. They are like percentages without knowing what 100% represents.

This is useful but not sufficient. A robot needs to know "that rock is 3 meters ahead, I should turn." Not "that rock has depth value 0.35."

Think of it as a function that returns a sorted array of distances but without the actual numbers:

# What Depth Anything V2 gives you:
relative_depths = [0.12, 0.35, 0.42, 0.67, 0.91]
# You know the ORDER: item 0 is closest, item 4 is farthest
# You know the RATIOS: item 4 is ~7.5x farther than item 0
# You do NOT know: item 0 is 1.2 meters away

# What you need for navigation:
metric_depths = [1.2, 3.5, 4.2, 6.7, 9.1]  # actual meters

The model gives you a dense depth map (every single pixel gets a value) but with an unknown scale factor. Now you need something that can provide that scale factor.

Ingredient 2: Visual SLAM (Sparse but Metric)

SLAM stands for Simultaneous Localization and Mapping. It solves a chicken-and-egg problem: to know where you are, you need a map. To build a map, you need to know where you are.

If you are a software developer, here is an analogy. Imagine you are building a version control system, but instead of tracking lines of code, you are tracking visual features across video frames.

A "visual feature" is a distinctive pattern in the image that is easy to find again in the next frame. Think: the corner of a window, the tip of a rock, a high-contrast spot on bark. These are like unique identifiers that the algorithm can grep for across frames.

Here is what SLAM does, frame by frame:

Frame 1:  Detect 500 features (corners, edges, distinctive patterns)
          Store their pixel positions: [(x1, y1), (x2, y2), ...]

Frame 2:  Detect features again
          MATCH them to Frame 1's features: "Feature A moved 12 pixels right"
          From how features moved, INFER how the camera moved
          From camera motion + feature motion, TRIANGULATE 3D positions

Frame 3:  Repeat. Now you have camera poses for frames 1, 2, 3
          and 3D positions for tracked features
          This IS the map. The camera poses ARE the localization.

The specific SLAM system used here is called VINS-Mono. The "VI" stands for Visual-Inertial, meaning it also uses an IMU (Inertial Measurement Unit), which is a chip with an accelerometer and a gyroscope. Your phone has one. It measures acceleration in meters per second squared.

The IMU is critical because it gives SLAM something pure vision cannot: real-world scale. The accelerometer measures in physical units (m/s^2), so when SLAM integrates those measurements over time, the resulting positions are in real meters. Without the IMU, visual-only SLAM has the same scale ambiguity as the depth model. With it, you get metric positions.

The result: SLAM gives you a few hundred 3D points per frame, each with a real-world position in meters. But only a few hundred, not the millions you get from LiDAR, and only where distinctive visual features happen to exist. Smooth walls, flat ground, and uniform surfaces produce no features.

SLAM output per frame:
  Camera position: (x=2.3m, y=0.1m, z=1.4m)
  Camera orientation: (roll, pitch, yaw)
  3D feature points: [(3.1m, 0.5m, -0.2m), (4.7m, 1.2m, 0.3m), ...]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      Only ~100-500 points. Sparse. But in real meters.

Why Neither Works Alone

Now the situation is clear:

System	Coverage	Scale
Depth Anything V2	Every pixel (dense)	Unknown (relative only)
VINS-Mono SLAM	A few hundred points (sparse)	Real meters (metric)

The depth model sees everything but does not know how far anything is in real units. SLAM knows exact distances but only for a handful of points. Each system has precisely what the other lacks.

Implementation

Phase 1: The Rescaling Trick

This is the core idea, and it is surprisingly simple.

For each SLAM point, you know two things:

Its metric depth from SLAM (e.g., 4.2 meters)
Its pixel position in the image (e.g., pixel 340, 220)

You also have the depth model's output for that same pixel: 3. Its relative depth from Depth Anything V2 (e.g., 0.38)

From these, you can compute a scale factor:

# For one SLAM point:
scale = metric_depth / relative_depth
scale = 4.2 / 0.38
scale = 11.05

# Apply to ALL pixels:
for every pixel in depth_map:
    metric_depth[pixel] = scale * relative_depth[pixel]

In practice, you have hundreds of SLAM points, and each gives a slightly different scale estimate due to noise. So you take a robust estimate across all of them, typically a median or a least-squares fit:

def compute_scale(slam_points, depth_map):
    scales = []
    for point in slam_points:
        pixel_x, pixel_y = project_to_image(point.position_3d)
        relative_d = depth_map[pixel_y][pixel_x]
        metric_d = point.depth
        if relative_d > 0:  # avoid division by zero
            scales.append(metric_d / relative_d)

    # Median is more robust to outliers than mean
    return median(scales)

Now multiply the entire depth map by this single scale factor, and every pixel has a metric depth estimate. You just turned a relative depth map into a metric one.

Before rescaling:                    After rescaling:
  Nearby tree  →  0.15              Nearby tree  →  1.7m
  Ground ahead →  0.42              Ground ahead →  4.6m
  Far treeline →  0.91              Far treeline →  10.1m

The result: dense metric depth from a single camera. Every pixel now has a distance in meters, just like LiDAR would give you, except computed from a $30 camera and some clever math.

Phase 2: Fixing the Jitter (Temporal Smoothing)

The rescaling trick works in theory. In practice, the scale factor jumps around.

SLAM points come and go as the robot moves. In one frame, you might have 300 well-tracked features with consistent scale estimates. In the next frame, the robot turns sharply, SLAM loses half its features, and the remaining ones give a different scale. The computed scale might jump from 11.0 to 14.5 between two frames.

From the robot's perspective, a tree that was 4 meters away just teleported to 5.3 meters away for one frame, then snapped back. The path planner sees the obstacle map changing erratically and makes bad decisions.

The fix is an exponential moving average, which is the same technique used in financial charts to smooth stock price jitter:

# alpha controls how much smoothing: 
#   alpha = 1.0 → no smoothing, use raw value
#   alpha = 0.0 → infinite smoothing, never update
#   alpha = 0.2 → heavy smoothing (the paper uses this)

stable_scale = None

def smooth_scale(new_raw_scale, alpha=0.2):
    global stable_scale
    if stable_scale is None:
        stable_scale = new_raw_scale
    else:
        stable_scale = (1 - alpha) * stable_scale + alpha * new_raw_scale
    return stable_scale

Each frame, the stable scale moves 20% toward the new measurement and keeps 80% of the previous value. Quick noise gets dampened out. Real changes (the robot drives closer to something) still come through, just slightly delayed.

Frame 1: raw_scale = 11.0  →  stable_scale = 11.0
Frame 2: raw_scale = 14.5  →  stable_scale = 11.0 * 0.8 + 14.5 * 0.2 = 11.7
Frame 3: raw_scale = 10.8  →  stable_scale = 11.7 * 0.8 + 10.8 * 0.2 = 11.5
Frame 4: raw_scale = 11.2  →  stable_scale = 11.5 * 0.8 + 11.2 * 0.2 = 11.4

The spike at Frame 2 barely registers. The stable scale stays near 11, which is where it should be.

Phase 3: Removing Phantom Obstacles (Edge Masking)

This one is subtle and took real debugging to identify.

Depth Anything V2 produces excellent depth estimates in the middle of objects. But at the edges, where a tree trunk meets the sky, the depth map does not produce a sharp cutoff. It creates a smooth gradient from "3 meters" (the tree) to "50 meters" (the sky) across several pixels.

When you convert these depth values to 3D points, those gradient pixels become points floating in mid-air between the tree and the sky. To the robot's navigation system, those floating points look like invisible walls behind every obstacle.

What the camera sees:

    |  sky (far)  | tree (near) |  sky (far)  |
    |             |             |             |

What the depth model produces at the tree edge:

    | 50m | 50m | 35m | 12m | 3m | 3m | 3m | 15m | 40m | 50m |
                  ^^^^^^^^^^^                  ^^^^^^^^^^^
                  Blurry transition             Blurry transition
                  These become phantom          These too
                  obstacles in 3D

What the robot's planner sees:

    [ clear ] [WALL??] [ tree ] [WALL??] [ clear ]
                ↑                  ↑
        "I can't go          "I can't go
         behind the           behind the
         tree!"               tree!"

The robot refuses to navigate around obstacles because it thinks there are walls behind them.

The fix: detect these blurry edges and remove them. The paper uses a Sobel filter, which is a simple 3x3 kernel that computes how fast pixel values change in the x and y directions. High change rate = an edge.

def create_edge_mask(depth_map, threshold=0.1):
    # Compute gradients (how fast depth changes between neighboring pixels)
    gradient_x = depth_map[:, 1:] - depth_map[:, :-1]  # horizontal change
    gradient_y = depth_map[1:, :] - depth_map[:-1, :]  # vertical change

    # Combine into gradient magnitude
    gradient = sqrt(gradient_x**2 + gradient_y**2)

    # Where gradient is high, depth is changing fast = a blurry edge
    edge_pixels = gradient > threshold

    # Dilate the mask by 5 pixels in each direction
    # (the blur extends a few pixels around the actual edge)
    edge_mask = dilate(edge_pixels, radius=5)

    return edge_mask

# Usage: zero out depth values at edges before converting to 3D points
depth_map[edge_mask] = 0  # these pixels won't become 3D points

You lose a thin strip of depth information around every object boundary. But you eliminate the phantom walls entirely. The robot can now see that there is clear space behind a tree trunk, just like it would with LiDAR.

Phase 4: From Depth Map to Navigation

Once you have a clean metric depth map, the rest of the pipeline converts it into something a path planner can use:

Depth map to point cloud: Each pixel with a metric depth value becomes a 3D point. A 640x480 image with valid depth at 80% of pixels gives you ~245,000 3D points. That is comparable to a mid-range LiDAR.
Ground segmentation: Separate "this is the ground" from "this is an obstacle." The system uses a technique called Cloth Simulation Filter: imagine dropping a stiff cloth onto an upside-down version of your point cloud. The cloth settles on what would be the ground surface. Points below the cloth are ground, points above are obstacles.
Elevation map: Convert the obstacle points into a 2D grid where each cell stores the height of whatever is there. Flat ground cells have near-zero height. A rock might be 0.5 meters. A tree trunk might be 2 meters. The path planner treats any cell above a threshold as impassable.
Path planning: A standard A*-style planner finds a route through the elevation map from the robot's current position to the goal, avoiding cells marked as obstacles.

Camera image
    ↓
Depth Anything V2 → relative depth map
    ↓
SLAM rescaling → metric depth map
    ↓
Edge masking → clean metric depth map
    ↓
Project to 3D → point cloud (~200k points)
    ↓
Ground segmentation → ground vs obstacle points
    ↓
Elevation map → 2D grid with heights
    ↓
Path planner → "go left, then straight for 5m"

How It All Fits Together

The full system runs as a loop, processing each camera frame through the pipeline:

┌──────────────────────────────────────────────────────┐
│                    Camera Frame                       │
└──────────────────┬───────────────────────────────────┘
                   │
        ┌──────────┴──────────┐
        ▼                     ▼
┌───────────────┐    ┌──────────────────┐
│ Depth Anything│    │    VINS-Mono     │
│     V2        │    │    (SLAM)        │
│               │    │                  │
│ Dense relative│    │ Camera pose +    │
│ depth map     │    │ sparse metric    │
│ (every pixel) │    │ 3D points       │
└───────┬───────┘    └────────┬─────────┘
        │                     │
        └──────────┬──────────┘
                   ▼
        ┌─────────────────────┐
        │   Scale Rescaling   │
        │ + Temporal Smoothing│
        │ + Edge Masking      │
        └──────────┬──────────┘
                   ▼
        ┌─────────────────────┐
        │  Dense metric depth │
        │  (every pixel, in   │
        │   real meters)      │
        └──────────┬──────────┘
                   ▼
        ┌─────────────────────┐
        │  3D Point Cloud     │
        │  → Ground Segment.  │
        │  → Elevation Map    │
        └──────────┬──────────┘
                   ▼
        ┌─────────────────────┐
        │    Path Planner     │
        │  → Motor Commands   │
        └─────────────────────┘

The entire loop runs at about 10 frames per second on an NVIDIA Jetson ORIN (a GPU board roughly the size of a credit card). For comparison, the same pipeline with a LiDAR sensor instead of the camera runs at 20 FPS, because you skip the depth estimation step entirely.

Lessons Learned

The results are surprisingly competitive. In simulation tests with photorealistic environments (trees, rocks, uneven terrain), the monocular camera achieved a 93% success rate at reaching navigation goals, while LiDAR achieved 63-67%. The camera setup actually outperformed LiDAR in most scenarios. In real-world tests, both hit 100% success rates, though the camera paths were about 22% longer due to slower obstacle detection.

Configuration	Success Rate (sim, medium)	Path Efficiency	Perception Speed
LiDAR (128-channel)	63-67%	0.49-0.60 SPL	20 Hz
Monocular camera	93%	0.83 SPL	10 Hz

Edge masking is the unsung hero. Without it, the robot gets stuck constantly because it sees phantom walls behind every obstacle. The ablation study showed that enabling both edge masking and temporal smoothing improved path efficiency from 0.39 to 0.58, a 49% improvement. Most of that gain comes from edge masking enabling the robot to navigate around obstacles rather than treating them as impassable walls.

It breaks on soft obstacles. High grass is the worst case. The depth model sees grass as a surface at a certain height and the system has no way to know that you can drive through it. LiDAR has the same problem to some degree, but its higher resolution and update rate help. This is a fundamental limitation: you need a separate system that classifies whether an obstacle is traversable, not just where it is.

It cannot see holes. A depression in the ground, a ditch, a hidden drop-off: these produce no signal in the depth map because they are below the ground plane. LiDAR can sometimes detect these from the right angle. A single forward-facing camera cannot. This is a real safety concern for off-road applications.

Speed matters more than accuracy. The monocular system is accurate enough for navigation, but it runs at half the frame rate of the LiDAR pipeline. In real-world tests, this showed up as the robot reacting more slowly to obstacles, producing longer, more hesitant paths. The depth estimation model (Depth Anything V2) is the bottleneck. A faster model with slightly worse accuracy would likely improve overall navigation performance.

The combination is more powerful than either component. This is the central lesson. Neither Depth Anything V2 nor VINS-Mono alone can navigate a robot through a forest. The depth model gives you dense coverage with no scale. SLAM gives you metric scale with sparse coverage. The rescaling trick, which is just a division, bridges the gap. The engineering insight is recognizing that these two systems complement each other and combining them with minimal machinery.

What's Next

The paper identifies several natural extensions:

Traversability classification: Instead of treating everything above ground level as an obstacle, train a model to distinguish "hard obstacle (rock)" from "soft obstacle (tall grass you can drive through)"
Negative obstacle detection: Detecting holes and drop-offs likely requires either a different sensor angle, a second camera, or temporal analysis of how the ground plane changes as the robot approaches
Faster depth models: Depth Anything V2 is accurate but slow. Lighter models exist that trade some accuracy for 2-3x speed. At 20+ FPS, the monocular pipeline would match LiDAR reaction times

The broader takeaway is that foundation models, large AI models pretrained on massive datasets, are becoming practical building blocks for robotics. Depth Anything V2 was not trained for off-road robot navigation. It was trained on general images from the internet. Yet it works well enough, out of the box, with zero fine-tuning, to navigate a robot through a forest. That is a shift in how robotic systems get built.

Metric Depth from a Single Camera: How Robots See Without LiDAR