Depth estimation with DepthAnything and OpenVINO#

Depth Anything is a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, this project aims to build a simple yet powerful foundation model dealing with any images under any circumstances. The framework of Depth Anything is shown below. it adopts a standard pipeline to unleashing the power of large-scale unlabeled images. image.png

More details about model can be found in project web page, paper and official repository

In this tutorial we will explore how to convert and run DepthAnything using OpenVINO. An additional part demonstrates how to run quantization with NNCF to speed up the model.

Table of contents:#


from pathlib import Path
import platform

repo_dir = Path("Depth-Anything")

if not repo_dir.exists():
    !git clone
%cd Depth-Anything

%pip install -q "openvino>=2023.3.0" "datasets>=2.14.6" "nncf" "tqdm"
%pip install -q "typing-extensions>=4.9.0" eval-type-backport "gradio>=4.19"
%pip install -q -r requirements.txt --extra-index-url

if platform.python_version_tuple()[1] in ["8", "9"]:
    %pip install -q "gradio-imageslider<=0.0.17" "typing-extensions>=4.9.0"
Cloning into 'Depth-Anything'...
remote: Enumerating objects: 421, done.
remote: Counting objects: 100% (144/144), done.
remote: Compressing objects: 100% (105/105), done.
remote: Total 421 (delta 101), reused 43 (delta 39), pack-reused 277
Receiving objects: 100% (421/421), 237.89 MiB | 26.31 MiB/s, done.
Resolving deltas: 100% (144/144), done.
Load and run PyTorch model#

To be able run PyTorch model on CPU, we should disable xformers attention optimizations first.

attention_file_path = Path("./torchhub/facebookresearch_dinov2_main/dinov2/layers/")
orig_attention_path = attention_file_path.parent / ("orig_" +

if not orig_attention_path.exists():

    with"r") as f:
        data =
        data = data.replace("XFORMERS_AVAILABLE = True", "XFORMERS_AVAILABLE = False")
        with"w") as out_f:

DepthAnything.from_pretrained method creates PyTorch model class instance and load model weights. There are 3 available models in repository depends on VIT encoder size: * Depth-Anything-ViT-Small (24.8M) * Depth-Anything-ViT-Base (97.5M) * Depth-Anything-ViT-Large (335.3M)

We will use Depth-Anything-ViT-Small, but the same steps for running model and converting to OpenVINO are applicable for other models from DepthAnything family.

from depth_anything.dpt import DepthAnything

encoder = "vits"  # can also be 'vitb' or 'vitl'
model_id = "depth_anything_{:}14".format(encoder)
depth_anything = DepthAnything.from_pretrained(f"LiheYoung/{model_id}")
Prepare input data#

from PIL import Image

import requests

r = requests.get(

open("", "w").write(r.text)
from notebook_utils import download_file

)"furseal.png").resize((600, 400))
for simplicity of usage, model authors provide helper functions for preprocessing input image. The main conditions are that image size should be divisible on 14 (size of vit patch) and normalized in [0, 1] range.

from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet
from torchvision.transforms import Compose

import cv2
import torch

transform = Compose(
        NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),

image = cv2.cvtColor(cv2.imread("furseal.png"), cv2.COLOR_BGR2RGB) / 255.0
h, w = image.shape[:-1]
image = transform({"image": image})["image"]
image = torch.from_numpy(image).unsqueeze(0)

Run model inference#

Preprocessed image passed to model forward and model returns depth map in format B x H x W, where B is input batch size, H is preprocessed image height, W is preprocessed image width.

# depth shape: 1xHxW
depth = depth_anything(image)

After image processing finished, we can resize depth map into original image size and prepare it for visualization.

import torch.nn.functional as F
import numpy as np

depth = F.interpolate(depth[None], (h, w), mode="bilinear", align_corners=False)[0, 0]
depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0

depth = depth.cpu().detach().numpy().astype(np.uint8)
depth_color = cv2.applyColorMap(depth, cv2.COLORMAP_INFERNO)
from matplotlib import pyplot as plt

plt.imshow(depth_color[:, :, ::-1]);

Convert Model to OpenVINO IR format#

OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation (IR). OpenVINO model conversion API should be used for these purposes. ov.convert_model function accepts original PyTorch model instance and example input for tracing and returns ov.Model representing this model in OpenVINO framework. Converted model can be used for saving on disk using ov.save_model function or directly loading on device using core.complie_model.

import openvino as ov

OV_DEPTH_ANYTHING_PATH = Path(f"{model_id}.xml")

if not OV_DEPTH_ANYTHING_PATH.exists():
    ov_model = ov.convert_model(depth_anything, example_input=image, input=[1, 3, 518, 518])
    ov.save_model(ov_model, OV_DEPTH_ANYTHING_PATH)
Run OpenVINO model inference#

Now, we are ready to run OpenVINO model

Select inference device#

For starting work, please select inference device from dropdown list.

import ipywidgets as widgets

core = ov.Core()

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],

compiled_model = core.compile_model(OV_DEPTH_ANYTHING_PATH, device.value)

Run inference on image#

res = compiled_model(image)[0]
def get_depth_map(model_output):
    depth = model_output[0]
    depth = cv2.resize(depth, (w, h))
    depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0

    depth = depth.astype(np.uint8)
    depth_color = cv2.applyColorMap(depth, cv2.COLORMAP_INFERNO)
    return depth_color

depth_color = get_depth_map(res)
plt.imshow(depth_color[:, :, ::-1]);

Run inference on video#

    "./Coco Walking in Berkeley.mp4",

VIDEO_FILE = "./Coco Walking in Berkeley.mp4"
# Number of seconds of input video to process. Set `NUM_SECONDS` to 0 to process
# the full video.
# Set `ADVANCE_FRAMES` to 1 to process every frame from the input video
# Set `ADVANCE_FRAMES` to 2 to process every second frame. This reduces
# the time it takes to process the video.
# Set `SCALE_OUTPUT` to reduce the size of the result video
# If `SCALE_OUTPUT` is 0.5, the width and height of the result video
# will be half the width and height of the input video.
# The format to use for video encoding. The 'vp09` is slow,
# but it works on most systems.
# Try the `THEO` encoding if you have FFMPEG installed.
# FOURCC = cv2.VideoWriter_fourcc(*"THEO")
FOURCC = cv2.VideoWriter_fourcc(*"vp09")

# Create Path objects for the input video and the result video.
output_directory = Path("output")
result_video_path = output_directory / f"{Path(VIDEO_FILE).stem}_depth_anything.mp4"
cap = cv2.VideoCapture(str(VIDEO_FILE))
ret, image =
if not ret:
    raise ValueError(f"The video at {VIDEO_FILE} cannot be read.")
input_fps = cap.get(cv2.CAP_PROP_FPS)
input_video_frame_height, input_video_frame_width = image.shape[:2]

target_fps = input_fps / ADVANCE_FRAMES
target_frame_height = int(input_video_frame_height * SCALE_OUTPUT)
target_frame_width = int(input_video_frame_width * SCALE_OUTPUT)

print(f"The input video has a frame width of {input_video_frame_width}, " f"frame height of {input_video_frame_height} and runs at {input_fps:.2f} fps")
    "The output video will be scaled with a factor "
    f"{SCALE_OUTPUT}, have width {target_frame_width}, "
    f" height {target_frame_height}, and run at {target_fps:.2f} fps"
def normalize_minmax(data):
    """Normalizes the values in `data` between 0 and 1"""
    return (data - data.min()) / (data.max() - data.min())

def convert_result_to_image(result, colormap="viridis"):
    Convert network result of floating point numbers to an RGB image with
    integer values from 0-255 by applying a colormap.

    `result` is expected to be a single network result in 1,H,W shape
    `colormap` is a matplotlib colormap.
    result = result.squeeze(0)
    result = normalize_minmax(result)
    result = result * 255
    result = result.astype(np.uint8)
    result = cv2.applyColorMap(result, cv2.COLORMAP_INFERNO)[:, :, ::-1]
    return result

def to_rgb(image_data) -> np.ndarray:
    Convert image_data from BGR to RGB
    return cv2.cvtColor(image_data, cv2.COLOR_BGR2RGB)
import time
from IPython.display import (

def process_video(compiled_model, video_file, result_video_path):
    # Initialize variables.
    input_video_frame_nr = 0
    start_time = time.perf_counter()
    total_inference_duration = 0

    # Open the input video
    cap = cv2.VideoCapture(str(video_file))

    # Create a result video.
    out_video = cv2.VideoWriter(
        (target_frame_width * 2, target_frame_height),

    num_frames = int(NUM_SECONDS * input_fps)
    total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT) if num_frames == 0 else num_frames
    progress_bar = ProgressBar(total=total_frames)

        while cap.isOpened():
            ret, image =
            if not ret:

            if input_video_frame_nr >= total_frames:

            h, w = image.shape[:-1]
            input_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) / 255.0
            input_image = transform({"image": input_image})["image"]
            # Reshape the image to network input shape NCHW.
            input_image = np.expand_dims(input_image, 0)

            # Do inference.
            inference_start_time = time.perf_counter()
            result = compiled_model(input_image)[0]
            inference_stop_time = time.perf_counter()
            inference_duration = inference_stop_time - inference_start_time
            total_inference_duration += inference_duration

            if input_video_frame_nr % (10 * ADVANCE_FRAMES) == 0:
                # input_video_frame_nr // ADVANCE_FRAMES gives the number of
                # Frames that have been processed by the network.
                        f"Processed frame {input_video_frame_nr // ADVANCE_FRAMES}"
                        f"/{total_frames // ADVANCE_FRAMES}. "
                        f"Inference time per frame: {inference_duration:.2f} seconds "
                        f"({1/inference_duration:.2f} FPS)"

            # Transform the network result to a RGB image.
            result_frame = to_rgb(convert_result_to_image(result))
            # Resize the image and the result to a target frame shape.
            result_frame = cv2.resize(result_frame, (target_frame_width, target_frame_height))
            image = cv2.resize(image, (target_frame_width, target_frame_height))
            # Put the image and the result side by side.
            stacked_frame = np.hstack((image, result_frame))
            # Save a frame to the video.

            input_video_frame_nr = input_video_frame_nr + ADVANCE_FRAMES
            cap.set(1, input_video_frame_nr)

            progress_bar.progress = input_video_frame_nr

    except KeyboardInterrupt:
        print("Processing interrupted.")
        processed_frames = num_frames // ADVANCE_FRAMES
        end_time = time.perf_counter()
        duration = end_time - start_time

            f"Processed {processed_frames} frames in {duration:.2f} seconds. "
            f"Total FPS (including video processing): {processed_frames/duration:.2f}."
            f"Inference FPS: {processed_frames/total_inference_duration:.2f} "
        print(f"Video saved to '{str(result_video_path)}'.")
    return stacked_frame
stacked_frame = process_video(compiled_model, VIDEO_FILE, result_video_path)
def display_video(stacked_frame):
    video = Video(result_video_path, width=800, embed=True)
    if not result_video_path.exists():
        raise ValueError("OpenCV was unable to write the video file. Showing one video frame.")
        print(f"Showing video saved at\n{result_video_path.resolve()}")
        print("If you cannot see the video in your browser, please click on the " "following link to download the video ")
        video_link = FileLink(result_video_path)
        video_link.html_link_str = "<a href='%s' download>%s</a>"
