2025-11-15T00:16:11.455248

New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Deb, Prajwal, Zisserman
In this paper, we present a novel keypoint-based classification model designed to recognise British Sign Language (BSL) words within continuous signing sequences. Our model's performance is assessed using the BOBSL dataset, revealing that the keypoint-based approach surpasses its RGB-based counterpart in computational efficiency and memory usage. Furthermore, it offers expedited training times and demands fewer computational resources. To the best of our knowledge, this is the inaugural application of a keypoint-based model for BSL word classification, rendering direct comparisons with existing works unavailable.
academic

New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Basic Information

  • Paper ID: 2412.09475
  • Title: New keypoint-based approach for recognising British Sign Language (BSL) from sequences
  • Authors: Oishi Deb, KR Prajwal, Andrew Zisserman (Visual Geometry Group, University of Oxford)
  • Classification: cs.CV cs.AI
  • Publication Time/Conference: International Conference on Computer Vision (ICCV) - HANDS Workshop, 2023
  • Paper Link: https://arxiv.org/abs/2412.09475

Abstract

This paper proposes a novel keypoint-based classification model for recognizing British Sign Language (BSL) words in continuous sign language sequences. The model is evaluated on the BOBSL dataset, demonstrating that the keypoint-based approach surpasses RGB-based counterparts in computational efficiency and memory usage, while providing faster training times and requiring fewer computational resources. To the authors' knowledge, this is the first application of keypoint-based models to BSL word classification, making direct comparison with existing work infeasible.

Research Background and Motivation

Problem Definition

Sign language recognition is an important computer vision task aimed at automatically recognizing sign language words or phrases from video sequences. Traditional methods primarily rely on RGB video but suffer from high computational complexity and sensitivity to environmental factors.

Significance

  1. Social Impact: Enhancing accessibility for the deaf community and promoting inclusive communication
  2. Technical Challenges: Co-articulation phenomena in continuous sign language make the recognition task highly challenging
  3. Real-time Requirements: Practical applications require efficient models capable of real-time processing

Limitations of Existing Methods

  1. RGB Methods: High computational complexity, large memory footprint, lengthy training times
  2. Environmental Sensitivity: Susceptible to lighting, clothing, and other external factors
  3. Poor Real-time Performance: Difficult to meet real-time application requirements

Research Motivation

The authors propose using 2D keypoint representations to address these issues, based on three primary reasons:

  1. Controllability: Flexible selection of keypoint subsets to control computational costs
  2. Compactness: Elimination of interference from lighting and clothing, providing more compact representations
  3. Real-time Capability: Keypoints can be computed in real-time, supporting real-time model execution

Core Contributions

  1. Novel Application: First application of keypoint-based methods to BSL word classification
  2. Efficient Architecture: Proposes a Transformer-based architecture for processing keypoint sequences
  3. Computational Efficiency: Significantly reduces computational costs, memory usage, and training time compared to RGB methods
  4. Practical Value: Provides a more efficient and practical solution for sign language recognition

Methodology Details

Task Definition

  • Input: 2D keypoint representations of continuous BSL sign language video sequences
  • Output: Classification results for 8,162 BSL word categories
  • Constraints: Handling co-articulation phenomena and supporting real-time processing

Keypoint Extraction

Keypoints are extracted using the MediaPipe library:

  • Pose Keypoints: 33 points
  • Hand Keypoints: 21 points for each of left and right hands
  • Facial Keypoints: 468 points (reduced to 128 in the 203kp model)
  • Total: 543 keypoints (or 203 keypoints in the simplified version)

Model Architecture

Input Representation

  • Extract keypoint sequences from consecutive 16 frames (based on findings that co-articulation lasts 13-20 frames)
  • Form a three-dimensional vector of 16 × K × 2, where K is the number of keypoints per frame

Transformer Architecture

  1. Tokenizer: Tokenizes input data
  2. Positional Encoding: Adds positional information to distinguish sequence order
  3. Encoder: 6-layer encoder, each layer containing:
    • Multi-head self-attention mechanism (8 attention heads)
    • Position-wise feed-forward network
    • Layer normalization
  4. Generator: Converts learned representations into classification outputs

Attention Mechanism

  • Frame-wise Attention: Frame-level attention modeling
  • Trajectory-wise Attention: Trajectory-level attention modeling
  • Uses scaled dot-product attention mechanism

Technical Innovations

  1. Direct Keypoint Input: Unlike graph neural network-based methods, directly uses keypoints as Transformer input
  2. Temporal Modeling: Leverages Transformer's self-attention mechanism to capture long-range dependencies
  3. Multi-scale Keypoints: Explores different keypoint configurations to balance performance and efficiency
  4. Data Augmentation: Augmentation strategies designed for keypoints (translation, scaling, rotation, flipping)

Experimental Setup

Dataset

BOBSL Dataset:

  • Scale: 1,467 hours of BBC programming
  • Resolution: 444×444 pixels, 25fps
  • Vocabulary: 8,162 sign language words
  • Signers: 39 sign language interpreters
  • Training Set: 8,162 unique words, 3,555,141 frames
  • Validation Set: 3,348 words, 53,768 frames
  • Partition Strategy: Divided by signers to ensure no signer overlap between training, validation, and test sets

Evaluation Metrics

  • Top-5 accuracy

Implementation Details

  • Optimizer: Adam optimizer, learning rate 1e-4
  • Batch Size: 128
  • Early Stopping: Stops when validation loss shows no improvement for 3 consecutive epochs
  • Model Dimension: 512-dimensional embeddings
  • Parameter Count: 23.9 million parameters (vs. 34.5 million for RGB models)

Experimental Results

Main Results

  • Accuracy: Top-5 accuracy reaches 60%
  • Parameter Efficiency: 30.7% reduction in parameters compared to RGB methods (23.9M vs 34.5M)
  • Computational Efficiency: Significantly reduces computational costs, memory usage, and training time

Keypoint Quantity Comparison

  • 543 Keypoint Model: Uses 468 facial keypoints
  • 203 Keypoint Model: Uses 128 facial keypoints
  • Finding: Increasing facial keypoint quantity improves performance

Data Augmentation Effects

Multiple augmentation techniques were tested:

  1. Translation Augmentation: Provides maximum performance improvement
  2. Scaling Augmentation: Scaling within 90-110% range
  3. Rotation Augmentation: Small-angle rotation
  4. Horizontal Flipping: Mirror flipping

Each augmentation method independently improves model performance, with translation augmentation being most effective.

Experimental Findings

  1. Facial keypoints are crucial for BSL recognition
  2. Keypoint-based methods significantly reduce computational costs while maintaining reasonable accuracy
  3. Data augmentation techniques are equally effective for keypoint models

BSL Recognition Research

  • Previous work primarily uses RGB video for BSL recognition
  • Focus on co-articulation and lip pattern recognition
  • This paper presents the first pure keypoint-based method

Keypoint Representation Research

  • Evolution from hand-crafted feature engineering to deep learning methods (CNNs)
  • Application of graph neural networks (GNNs) in action and gesture recognition
  • Successful application of Transformer architecture in computer vision

Technical Comparison

This paper adopts the approach of directly inputting keypoints to Transformer, distinguishing it from traditional graph neural network construction methods.

Conclusions and Discussion

Main Conclusions

  1. Keypoint-based methods demonstrate significant computational advantages in BSL recognition
  2. Transformer architecture effectively processes keypoint sequences
  3. Facial keypoints are critical for BSL recognition performance
  4. Appropriate data augmentation further improves model performance

Limitations

  1. Accuracy: 60% accuracy still has room for improvement
  2. Missing Comparisons: As the first keypoint-based method, lacks direct comparison baselines
  3. Dataset Limitations: Validated only on the BOBSL dataset
  4. Real-time Verification: Lacks actual real-time performance testing

Future Directions

  1. Multimodal Fusion: Combining keypoints and RGB images to improve accuracy
  2. 3D Pose Estimation: Exploring sequence-level 3D pose estimation techniques
  3. Skeleton Images: Attempting black-and-white skeleton image representations based on keypoints
  4. Larger-scale Validation: Validating the method on more sign language datasets

In-depth Evaluation

Strengths

  1. Strong Innovation: First application of pure keypoint-based methods to BSL recognition
  2. High Practical Value: Significantly reduces computational costs, suitable for resource-constrained environments
  3. Sound Methodology: Clear technical approach with complete implementation details
  4. Comprehensive Experiments: Includes comparative experiments with multiple configurations and augmentation strategies

Weaknesses

  1. Limited Performance: 60% accuracy is relatively low
  2. Lack of Comparisons: Cannot directly compare with other methods
  3. Insufficient Analysis: Lacks in-depth analysis of failure cases
  4. Unknown Generalization: Validated only on a single dataset

Impact

  1. Pioneering: Provides a new technical pathway for sign language recognition
  2. Practicality: Efficient methods facilitate real-world deployment
  3. Extensibility: Provides a solid foundation for subsequent research
  4. Social Value: Contributes to improving technological accessibility for the deaf community

Applicable Scenarios

  1. Resource-constrained Environments: Mobile devices, edge computing scenarios
  2. Real-time Applications: Interactive systems requiring rapid response
  3. Large-scale Deployment: Scenarios requiring processing large volumes of video data
  4. Research Prototypes: As a foundational component for more complex systems

References

The paper cites multiple important related works, including:

  • BOBSL dataset-related papers 3
  • MediaPipe keypoint extraction framework 13
  • Original Transformer architecture paper 18
  • Sign language recognition research 1,2,6
  • Application of graph neural networks in action recognition 21

Overall Assessment: This is a pioneering paper that applies keypoint-based methods to BSL recognition for the first time. While there is room for improvement in accuracy, its significant advantages in computational efficiency provide important practical value. This work provides a new research direction for the sign language recognition field, with particular significance in resource-constrained and real-time application scenarios.