Video Recognition using SlowFast and OpenVINO™

This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS. To run without installing anything, click the “launch binder” button.

Binder Github

Teaching machines to detect, understand and analyze the contents of images has been one of the more well-known and well-studied problems in computer vision. However, analyzing videos to understand what is happening in them and detecting objects of interest are equally important and challenging tasks that have widespread applications in several areas including autonomous driving, healthcare, security, and many more.

The SlowFast model puts forth an interesting approach to analyzing videos based on the intuition that videos typically contain static as well as dynamic elements- use a slow pathway operating at a low frame rate to analyze the static content and a fast pathway operating at a high frame rate to capture dynamic content. Its strength lies in its ability to effectively capture both fast and slow-motion information in video sequences, making it particularly well-suited to tasks that require a temporal and spatial understanding of the data.

image0

More details about the network can be found in the original paper and repository.

In this notebook, we will see how to convert and run a ResNet-50 based SlowFast model using OpenVINO.

This tutorial consists of the following steps

  • Preparing the PyTorch model

  • Download and prepare data

  • Check inference with the PyTorch model

  • Convert to ONNX format

  • Convert ONNX Model to OpenVINO Intermediate Representation

  • Verify inference with the converted model

Table of contents:

Prepare PyTorch Model

Install necessary packages

!pip install -q "openvino>=2023.1.0"
!pip install -q fvcore

Imports and Settings

import json
import math
import sys
import cv2
import torch
import numpy as np
from pathlib import Path
from typing import Any, List, Dict
from IPython.display import Video
import openvino as ov

sys.path.append("../utils")
from notebook_utils import download_file
DATA_DIR = Path("data/")
MODEL_DIR = Path("model/")
MODEL_DIR.mkdir(exist_ok=True)
DATA_DIR.mkdir(exist_ok=True)

To begin, we download the PyTorch model from the PyTorchVideo repository. In this notebook, we will be using a SlowFast Network based on the ResNet-50 architecture trained on the Kinetics 400 dataset. Kinetics 400 is a large-scale dataset for action recognition in videos, containing 400 human action classes, with at least 400 video clips for each action. Read more about the dataset and the paper here.

MODEL_NAME = "slowfast_r50"
MODEL_REPOSITORY = "facebookresearch/pytorchvideo"
DEVICE = "cpu"

# load the pretrained model from the repository
model = torch.hub.load(
    repo_or_dir=MODEL_REPOSITORY, model=MODEL_NAME, pretrained=True, skip_validation=True
)

# set the device to allocate tensors to. for example, "cpu" or "cuda"
model.to(DEVICE)

# set the model to eval mode
model.eval()
Using cache found in /opt/home/k8sworker/.cache/torch/hub/facebookresearch_pytorchvideo_main
Net(
  (blocks): ModuleList(
    (0): MultiPathWayWithFuse(
      (multipathway_blocks): ModuleList(
        (0): ResNetBasicStem(
          (conv): Conv3d(3, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
          (norm): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
          (pool): MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=[0, 1, 1], dilation=1, ceil_mode=False)
        )
        (1): ResNetBasicStem(
          (conv): Conv3d(3, 8, kernel_size=(5, 7, 7), stride=(1, 2, 2), padding=(2, 3, 3), bias=False)
          (norm): BatchNorm3d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
          (pool): MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=[0, 1, 1], dilation=1, ceil_mode=False)
        )
      )
      (multipathway_fusion): FuseFastToSlow(
        (conv_fast_to_slow): Conv3d(8, 16, kernel_size=(7, 1, 1), stride=(4, 1, 1), padding=(3, 0, 0), bias=False)
        (norm): BatchNorm3d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activation): ReLU()
      )
    )
    (1): MultiPathWayWithFuse(
      (multipathway_blocks): ModuleList(
        (0): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(80, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
              (branch1_norm): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(80, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_a): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-2): 2 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(256, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_a): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
        (1): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(8, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
              (branch1_norm): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(8, 8, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(8, 8, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(8, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-2): 2 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(32, 8, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(8, 8, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(8, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
      )
      (multipathway_fusion): FuseFastToSlow(
        (conv_fast_to_slow): Conv3d(32, 64, kernel_size=(7, 1, 1), stride=(4, 1, 1), padding=(3, 0, 0), bias=False)
        (norm): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activation): ReLU()
      )
    )
    (2): MultiPathWayWithFuse(
      (multipathway_blocks): ModuleList(
        (0): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(320, 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
              (branch1_norm): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(320, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_a): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(128, 128, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(128, 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-3): 3 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(512, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_a): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(128, 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(128, 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
        (1): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(32, 64, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
              (branch1_norm): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(32, 16, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(16, 16, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(16, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-3): 3 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(64, 16, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(16, 16, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(16, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
      )
      (multipathway_fusion): FuseFastToSlow(
        (conv_fast_to_slow): Conv3d(64, 128, kernel_size=(7, 1, 1), stride=(4, 1, 1), padding=(3, 0, 0), bias=False)
        (norm): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activation): ReLU()
      )
    )
    (3): MultiPathWayWithFuse(
      (multipathway_blocks): ModuleList(
        (0): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(640, 1024, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
              (branch1_norm): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(640, 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(256, 256, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(256, 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-5): 5 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(1024, 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(256, 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(256, 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
        (1): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(64, 128, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
              (branch1_norm): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(64, 32, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-5): 5 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(128, 32, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
      )
      (multipathway_fusion): FuseFastToSlow(
        (conv_fast_to_slow): Conv3d(128, 256, kernel_size=(7, 1, 1), stride=(4, 1, 1), padding=(3, 0, 0), bias=False)
        (norm): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (activation): ReLU()
      )
    )
    (4): MultiPathWayWithFuse(
      (multipathway_blocks): ModuleList(
        (0): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(1280, 2048, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
              (branch1_norm): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(1280, 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(512, 512, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(512, 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-2): 2 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(2048, 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(512, 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(512, 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
        (1): ResStage(
          (res_blocks): ModuleList(
            (0): ResBlock(
              (branch1_conv): Conv3d(128, 256, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
              (branch1_norm): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(128, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
            (1-2): 2 x ResBlock(
              (branch2): BottleneckBlock(
                (conv_a): Conv3d(256, 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
                (norm_a): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_a): ReLU()
                (conv_b): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
                (norm_b): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                (act_b): ReLU()
                (conv_c): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
                (norm_c): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              )
              (activation): ReLU()
            )
          )
        )
      )
      (multipathway_fusion): Identity()
    )
    (5): PoolConcatPathway(
      (pool): ModuleList(
        (0): AvgPool3d(kernel_size=(8, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
        (1): AvgPool3d(kernel_size=(32, 7, 7), stride=(1, 1, 1), padding=(0, 0, 0))
      )
    )
    (6): ResNetBasicHead(
      (dropout): Dropout(p=0.5, inplace=False)
      (proj): Linear(in_features=2304, out_features=400, bias=True)
      (output_pool): AdaptiveAvgPool3d(output_size=1)
    )
  )
)

Now that we have loaded our pre-trained model, we will check inference with it. Since the model returns the detected class IDs, we download the ID to class label mapping for the Kinetics 400 dataset and load the mapping to a dict for later use.

CLASSNAMES_SOURCE = (
    "https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json"
)
CLASSNAMES_FILE = "kinetics_classnames.json"

download_file(url=CLASSNAMES_SOURCE, directory=DATA_DIR, show_progress=True)

# load from json
with open(DATA_DIR / CLASSNAMES_FILE, "r") as f:
    kinetics_classnames = json.load(f)

# load dict of id to class label mapping
kinetics_id_to_classname = {}
for k, v in kinetics_classnames.items():
    kinetics_id_to_classname[v] = str(k).replace('"', "")
data/kinetics_classnames.json: 0.00B [00:00, ?B/s]

Let us download a sample video to run inference on, and take a look at the downloaded video.

VIDEO_SOURCE = "https://dl.fbaipublicfiles.com/pytorchvideo/projects/archery.mp4"
VIDEO_NAME = "archery.mp4"
VIDEO_PATH = DATA_DIR / VIDEO_NAME

download_file(url=VIDEO_SOURCE, directory=DATA_DIR, show_progress=True)
Video(VIDEO_PATH, embed=True)
data/archery.mp4:   0%|          | 0.00/536k [00:00<?, ?B/s]