2025-11-15T00:16:11.455248

New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Deb, Prajwal, Zisserman

In this paper, we present a novel keypoint-based classification model designed to recognise British Sign Language (BSL) words within continuous signing sequences. Our model's performance is assessed using the BOBSL dataset, revealing that the keypoint-based approach surpasses its RGB-based counterpart in computational efficiency and memory usage. Furthermore, it offers expedited training times and demands fewer computational resources. To the best of our knowledge, this is the inaugural application of a keypoint-based model for BSL word classification, rendering direct comparisons with existing works unavailable.

academic

New keypoint-based approach for recognising British Sign Language (BSL) from sequences

Basic Information

Paper ID: 2412.09475
Title: New keypoint-based approach for recognising British Sign Language (BSL) from sequences
Authors: Oishi Deb, KR Prajwal, Andrew Zisserman (Visual Geometry Group, University of Oxford)
Classification: cs.CV cs.AI
Publication Time/Conference: International Conference on Computer Vision (ICCV) - HANDS Workshop, 2023
Paper Link: https://arxiv.org/abs/2412.09475

Abstract

This paper proposes a novel keypoint-based classification model for recognizing British Sign Language (BSL) words in continuous sign language sequences. The model is evaluated on the BOBSL dataset, demonstrating that the keypoint-based approach surpasses RGB-based counterparts in computational efficiency and memory usage, while providing faster training times and requiring fewer computational resources. To the authors' knowledge, this is the first application of keypoint-based models to BSL word classification, making direct comparison with existing work infeasible.

Research Background and Motivation

Problem Definition

Sign language recognition is an important computer vision task aimed at automatically recognizing sign language words or phrases from video sequences. Traditional methods primarily rely on RGB video but suffer from high computational complexity and sensitivity to environmental factors.

Significance

Social Impact: Enhancing accessibility for the deaf community and promoting inclusive communication
Technical Challenges: Co-articulation phenomena in continuous sign language make the recognition task highly challenging
Real-time Requirements: Practical applications require efficient models capable of real-time processing

Limitations of Existing Methods

RGB Methods: High computational complexity, large memory footprint, lengthy training times
Environmental Sensitivity: Susceptible to lighting, clothing, and other external factors
Poor Real-time Performance: Difficult to meet real-time application requirements

Research Motivation

The authors propose using 2D keypoint representations to address these issues, based on three primary reasons:

Controllability: Flexible selection of keypoint subsets to control computational costs
Compactness: Elimination of interference from lighting and clothing, providing more compact representations
Real-time Capability: Keypoints can be computed in real-time, supporting real-time model execution

Core Contributions

Novel Application: First application of keypoint-based methods to BSL word classification
Efficient Architecture: Proposes a Transformer-based architecture for processing keypoint sequences
Computational Efficiency: Significantly reduces computational costs, memory usage, and training time compared to RGB methods
Practical Value: Provides a more efficient and practical solution for sign language recognition

Methodology Details

Task Definition

Input: 2D keypoint representations of continuous BSL sign language video sequences
Output: Classification results for 8,162 BSL word categories
Constraints: Handling co-articulation phenomena and supporting real-time processing

Keypoint Extraction

Keypoints are extracted using the MediaPipe library:

Pose Keypoints: 33 points
Hand Keypoints: 21 points for each of left and right hands
Facial Keypoints: 468 points (reduced to 128 in the 203kp model)
Total: 543 keypoints (or 203 keypoints in the simplified version)

Model Architecture

Input Representation

Extract keypoint sequences from consecutive 16 frames (based on findings that co-articulation lasts 13-20 frames)
Form a three-dimensional vector of 16 × K × 2, where K is the number of keypoints per frame

Transformer Architecture

Tokenizer: Tokenizes input data
Positional Encoding: Adds positional information to distinguish sequence order
Encoder: 6-layer encoder, each layer containing:
- Multi-head self-attention mechanism (8 attention heads)
- Position-wise feed-forward network
- Layer normalization
Generator: Converts learned representations into classification outputs

Attention Mechanism

Frame-wise Attention: Frame-level attention modeling
Trajectory-wise Attention: Trajectory-level attention modeling
Uses scaled dot-product attention mechanism

Technical Innovations

Direct Keypoint Input: Unlike graph neural network-based methods, directly uses keypoints as Transformer input
Temporal Modeling: Leverages Transformer's self-attention mechanism to capture long-range dependencies
Multi-scale Keypoints: Explores different keypoint configurations to balance performance and efficiency
Data Augmentation: Augmentation strategies designed for keypoints (translation, scaling, rotation, flipping)

Experimental Setup

Dataset

BOBSL Dataset:

Scale: 1,467 hours of BBC programming
Resolution: 444×444 pixels, 25fps
Vocabulary: 8,162 sign language words
Signers: 39 sign language interpreters
Training Set: 8,162 unique words, 3,555,141 frames
Validation Set: 3,348 words, 53,768 frames
Partition Strategy: Divided by signers to ensure no signer overlap between training, validation, and test sets

Evaluation Metrics

Top-5 accuracy

Implementation Details

Optimizer: Adam optimizer, learning rate 1e-4
Batch Size: 128
Early Stopping: Stops when validation loss shows no improvement for 3 consecutive epochs
Model Dimension: 512-dimensional embeddings
Parameter Count: 23.9 million parameters (vs. 34.5 million for RGB models)

Experimental Results

Main Results

Accuracy: Top-5 accuracy reaches 60%
Parameter Efficiency: 30.7% reduction in parameters compared to RGB methods (23.9M vs 34.5M)
Computational Efficiency: Significantly reduces computational costs, memory usage, and training time

Keypoint Quantity Comparison

543 Keypoint Model: Uses 468 facial keypoints
203 Keypoint Model: Uses 128 facial keypoints
Finding: Increasing facial keypoint quantity improves performance

Data Augmentation Effects

Multiple augmentation techniques were tested:

Translation Augmentation: Provides maximum performance improvement
Scaling Augmentation: Scaling within 90-110% range
Rotation Augmentation: Small-angle rotation
Horizontal Flipping: Mirror flipping

Each augmentation method independently improves model performance, with translation augmentation being most effective.

Experimental Findings

Facial keypoints are crucial for BSL recognition
Keypoint-based methods significantly reduce computational costs while maintaining reasonable accuracy
Data augmentation techniques are equally effective for keypoint models

BSL Recognition Research

Previous work primarily uses RGB video for BSL recognition
Focus on co-articulation and lip pattern recognition
This paper presents the first pure keypoint-based method

Keypoint Representation Research

Evolution from hand-crafted feature engineering to deep learning methods (CNNs)
Application of graph neural networks (GNNs) in action and gesture recognition
Successful application of Transformer architecture in computer vision

Technical Comparison

This paper adopts the approach of directly inputting keypoints to Transformer, distinguishing it from traditional graph neural network construction methods.

Conclusions and Discussion

Main Conclusions

Keypoint-based methods demonstrate significant computational advantages in BSL recognition
Transformer architecture effectively processes keypoint sequences
Facial keypoints are critical for BSL recognition performance
Appropriate data augmentation further improves model performance

Limitations

Accuracy: 60% accuracy still has room for improvement
Missing Comparisons: As the first keypoint-based method, lacks direct comparison baselines
Dataset Limitations: Validated only on the BOBSL dataset
Real-time Verification: Lacks actual real-time performance testing

Future Directions

Multimodal Fusion: Combining keypoints and RGB images to improve accuracy
3D Pose Estimation: Exploring sequence-level 3D pose estimation techniques
Skeleton Images: Attempting black-and-white skeleton image representations based on keypoints
Larger-scale Validation: Validating the method on more sign language datasets

In-depth Evaluation

Strengths

Strong Innovation: First application of pure keypoint-based methods to BSL recognition
High Practical Value: Significantly reduces computational costs, suitable for resource-constrained environments
Sound Methodology: Clear technical approach with complete implementation details
Comprehensive Experiments: Includes comparative experiments with multiple configurations and augmentation strategies

Weaknesses

Limited Performance: 60% accuracy is relatively low
Lack of Comparisons: Cannot directly compare with other methods
Insufficient Analysis: Lacks in-depth analysis of failure cases
Unknown Generalization: Validated only on a single dataset

Impact

Pioneering: Provides a new technical pathway for sign language recognition
Practicality: Efficient methods facilitate real-world deployment
Extensibility: Provides a solid foundation for subsequent research
Social Value: Contributes to improving technological accessibility for the deaf community

Applicable Scenarios

Resource-constrained Environments: Mobile devices, edge computing scenarios
Real-time Applications: Interactive systems requiring rapid response
Large-scale Deployment: Scenarios requiring processing large volumes of video data
Research Prototypes: As a foundational component for more complex systems

References

The paper cites multiple important related works, including:

BOBSL dataset-related papers 3
MediaPipe keypoint extraction framework 13
Original Transformer architecture paper 18
Sign language recognition research 1,2,6
Application of graph neural networks in action recognition 21

Overall Assessment: This is a pioneering paper that applies keypoint-based methods to BSL recognition for the first time. While there is room for improvement in accuracy, its significant advantages in computational efficiency provide important practical value. This work provides a new research direction for the sign language recognition field, with particular significance in resource-constrained and real-time application scenarios.