Path Signature Methodology for Landmark-based Human Action Recognition

This notebook is based on the paper > Weixin Yang, Terry Lyons, Hao Ni, Cordelia Schmid, Lianwen Jin, “Developing the Path Signature Methodology and its Application to Landmark-based Human Action Recognition”, arXiv preprint arXiv:1707.03993v2, 2019

which is available as an arXiv preprint.

Human action recognition is a challenging task in computer vision with many potential applications, ranging from human-computer interaction and video understanding to video surveillance, behavioural analysis and many more. In this task we are given a short video clip showing one person (or possibly several) performing exactly one action and the task is to output the action that was performed in the video.

In this notebook we give an introduction to the methodology developed in above paper. We begin with explaining the landmark-based approach and why it is a desirable approach. We then show how to generate a feature set based on above paper and train a simple classifier on the Joint-annotated Human Motion Data Base (JHMDB) dataset.

We begin by setting up the coding environment.

Set up the Notebook

Install Dependencies

There are two steps for installing the dependencies for this notebook:

import sys
!{sys.executable} -m pip install -r requirements.txt

For the learning we are going to use PyTorch. The correct version of PyTorch to install depends on your hardware, operating system and CUDA version, thus it is recommended to install PyTorch manually, following the official instructions. This notebook was developed and tested using PyTorch version 1.5.0, instructions for installing this version can be found here. If you don’t want to install PyTorch manually you can try uncommenting the second line in the following cell and run it and the notebook will attempt to install PyTorch v1.5.0 for you, we cannot guarantee you will end up with the right version to match your system though.

# Uncomment the following line to try to install torch via the extra requirements file (not recommended):
# !{sys.executable} -m pip install -r requirements_torch.txt

Import Packages

import os.path
import sys
import zipfile

from tqdm.notebook import tqdm, trange
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import torch

from datasetloader import JHMDB
from psfdataset import PSFDataset, transforms

import util

Download and Load the Dataset

DATASET_PATH = os.path.join('datasets', 'jhmdb_dataset')

ZIP_URLS = (('', 'videos'),
            ('', 'splits'),
            ('', 'sub_splits'),
            ('', 'joint_positions'))

for zip_url, target_dir in ZIP_URLS:
    zip_path = os.path.join(DATASET_PATH, zip_url.split('/')[-1])
    target_dir = os.path.join(DATASET_PATH, target_dir)

    if not os.path.exists(target_dir) and not os.path.exists(zip_path):
            os.makedirs(os.path.dirname(zip_path), exist_ok=True)
  , zip_path)

            if target_dir.split('/')[-1] != 'sub_splits':
                with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                with zipfile.ZipFile(zip_path, 'r') as zip_ref:

            if target_dir.split('/')[-1] == 'videos':
                os.rename(os.path.join(DATASET_PATH, 'ReCompress_Videos'), os.path.join(DATASET_PATH, 'videos'))

            if os.path.exists(zip_path):
Downloading dataset::   0%|          | 0/19 [00:00<?, ?KB/s]
# load the raw data
dataset.set_cols("video-filename", "keypoints2D")
21it [00:00, 51.56it/s]

Landmark-based Human Action Recognition

In landmark-based human action recognition, rather than use RGB video data directly, we use a set of points, the landmarks, which describe the positions of some of the major joints or parts of the human body such as shoulders, hips, knees etc. One can connect these landmarks to form a skeleton figure as shown in the following animation:

sample = dataset[200]
util.display_animation(sample["video-filename"], sample["keypoints2D"], include_source_video=False)

Note that the connections between landmarks are only for visualisation and we do not make use of them in our action recognition approach. Approaches which do make use of these connections are generally referred to as skeleton-based human action recognition. We choose a landmark-based approach as it ensures that the method is applicable to other landmark data where we may not understand the interactions between different landmarks as well as we do for the human body.

The landmark-based approach has multiple technical advantages:

  • Dimensionality reduction: We commonly use between 15 and 25 landmarks in 2 or 3 dimensions. This means a person in a single frame is described by a 30-75 dimensional vector. This is a much smaller input vector for our model compared to an RGB image.

  • Extract and separate human movement from other information: Notice how in above animation you can easily tell what the person is doing. This shows that the landmark data contains all necessary information about human movement for our task of action recognition, while discarding other information contained in the visual scene.

Moreover, using landmarks rather than RGB data has a practical advantage which is crucial for many real-world applications:

  • De-identification: Using landmarks avoids retaining personally identifiable information contained in the source video. In many applications, protecting people’s identities is either a regulatory requirement, or crucial for ethical use of the technology.

Landmark data has increasingly become available using commercial depth camera systems such as the Microsoft Kinect, as well as advances in human pose estimation research. This has led to the availability of free pose estimation tools such as OpenPose and AlphaPose. While these pose estimation tools have increased the availability of pose data and made applications outside a lab environment possible, it is important to note that these systems do often still suffer from noise, making the action recognition task harder. Most pose estimation software will output a confidence score for its prediction of each of the landmark coordinates. The methodology we are presenting in this notebook is general enough to include these confidence scores as extra dimensions of the data. Using the confidence scores one can increase the performance of the model using landmarks obtained by pose estimation to almost the same accuracy as achieved using noise-free groundtruth landmarks. The experimental results for this can be found in the above paper.

The dataset used in this notebook, Joint-annotated Human Motion Data Base (JHMDB) provides a collection of short video clips extracted from YouTube. Each clip represents a single human action and is labelled accordingly. Moreover, each clip in the dataset includes human-sourced ‘ground truth’ landmarks representing joint locations. Note that in the aforementioned paper by Yang et al., input features are alternatively constructed from estimated poses and ground truth landmark data. The authors observe that providing that confidence scores are used, pose estimation techniques yield competitive performance, compared to using ground truth landmark data. For the purpose of simplicity, we use ground truth landmark data as input features in this notebook.

Here are a few more examples with a side-by-side view of the original RGB video we see and the landmark representation the classifier sees:

sample = dataset[50]
util.display_animation(sample["video-filename"], sample["keypoints2D"])