American Sign Language Translation Approach Using Machine Learning
Encoder-Decoder Seq2Seq Model & Convolution 3D Classification Model
Table of Content
- Models & Results
- Discussion & Conclusions
Why & What
Although there are many language translation applications in the market, very few, involve sign language. People who use sign language may have hearing or speaking problems or are in the deaf community. They seem like a minority group but have a 0.5 billion population worldwide surfing from hearing disability, not to mention if we include their relatives, friends, and coworkers. It is quite a potential market, but few people pay attention to it.
Some sign language translation applications are in the market, but they are limited to word-to-word translation. Seeing the need of this situation, we want to try to offer a more straightforward solution for sign language translation by machine learning algorithms.
Sign Languages are languages that use the visual-gestural modality of a person’s hand, face, and body to deliver the message. Since there is no universal sign language, we will focus on the one that English speakers used most — American Sign Language (ASL). ASL is rooted in French Sign Language and has five features. They are:
(4)Hand location, and
(5)Gesture features like facial expression and posture.
The lack of a written form makes people use different representation systems to record sign language. Sign language notation is the drawing of symbols of their “written form” that developed after the 1990s. Gloss is the translation of a motion of sign language. It could be a word or a meaning of a sentence. Given the features of sign language, most translators in the market are focused on Pose -to-Gloss. What we like to do is focus on Video-to-Text, which is a very challenging task.
How2Sign is an open-source multimodal and multiview continuous American Sign Language dataset with annotations. All the videos have sentence-level alignment.
- 80 hours of sign language videos and corresponding English transcriptions
- 31 GB training, 1.7 GB validation, 2 GB test
Tool we used
- Mediapipe & OpenCV — to process the video data
- PyTorch — to build the model
- EC2, AWS — to train the model
- Recurrent Neural Networks (RNN) is a class of artificial neural networks where connections between nodes form a directed or undirected graph along a temporal sequence.
- Long Short Term Memory Networks (LSTM) is a kind of RNN that is designed to avoid the long-term dependency problem. It is composed of a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.
- Sequence to Sequence (Seq2seq) Model is typically used to solve complex Language problems and is based on encoder-decoder architecture.
- Encoder-Decoder Architecture's input and output are both variable-length sequences. The encoder reads the input sequence and summarizes the information to internal state vectors or context vectors. The decoder initial states are initialized to the final states of the Encoder LSTM.
- Convolution 3D neural networks have been used broadly in video classification, action recognition, and gesture recognition problems. Previous studies show that Conv3D effectively detects spatiotemporal features in the sequence of medical images, short clips, and videos. Convolution 3D is based on Convolution 2D but adds one dimension to represent sequence or time. On each convolution step, the Conv3D CNN creates a 3D features map whose output shape is a 3D volume space.
After convolution, a pooling layer is added as a downsampling technique. A pooling layer goal is to transform a tensor by replacing each of the subtensors with its maximum element value.
- Data Analysis
- Text Data Transformation
- Video-Processing and Transformation
- Modeling & Training
Exploratory Data Analysis
The original text data was labeled in the following table.
After text cleaning, removing stop words, and lemmatization, we ranked the most frequently used words in the dataset: going, want, get, like, and one. And there are 9,546 words in the dataset in total.
In our simple sentiment analysis, we could tell most of the content is positive and neutral, and very few of them have a negative tone. Then we analyze the length of each sentence. The following image is the distribution of the sentence length. The average length of a sentence is 7.9 words; the standard deviation is 5.6 words. Thus, we know that 75% of 31,165 videos have less than 10 words, and very few have a great number of words in one sentence.
We used MediaPipe to capture frames in the video and extract landmarks to key points. There are 554 Landmarks * 3 (x, y, z)= 1,662 Keypoints
After that, we conducted simple text processing and labeled with sentences. We tokenized the text with spacey and added a start <SOS> and end <EOS> token to every sentence.
return [tok.text.lower() for tok in spacy_en.tokenizer(text]
Then we built our bags of words from vocabulary in our dataset, which includes around 9,000 words.
3. Models & Results
There are two models in the projects, one is Encoder-Decoder Seq2Seq Model, and the other is the Conv3D architecture classification model.
Encoder-Decoder Seq2Seq Model
In our original design, we extracted 1,662 keypoints from videos as our input and used the Encoder-Decoder model with the layers of LSTM recurrent units to translate the sign language from motion sequence to English sequence.
Our architecture adds an attention layer to the encoder and decoder. Attention layers use mathematical methods to allow the model to focus on the input sequence's relevant features as needed and filter out specific elements from that sequence.
The following is the code of the model:
According to the training set loss function result, we plotted a learning curve of the Encoder-Decoder Seq2Seq Model and showed that models improved after training.
However, results indicate that the model is unable to translate accurately. The following is an example of the translation.
Since the results of the seq2seq over here are less than ideal, we also tried a different approach to apply a classification model. After the previous lecture review, we propose using a Convolution 3D architecture to recognize those spatiotemporal features that best describe sign language motion features in videos.
Convolution 3D Classification Model
Our proposed Conv3D architecture is designed based on architecture proposed by Vrskova et.al.
The following structure designs the architecture:
Conv3D: 6 Convolution 3D Layers
BatchNorm3D: 4 Batch normalization 3D layers
MaxPool3D: 5 MaxPool3D layers
Dropout: 5 Dropout layers with 0.5 value
Dense: 4 Linear layers
ReLu activation function after every convolution and dense layer
Sigmoid function at the last layer
Loss function: Binary Categorical Cross Entropy loss function
For our Conv3D Classification Model, we used the How2Sign dataset to train the model and added the TikTok video from TikTok API to evaluate the model. We extracted 981 no-sign videos and 222 sign videos, then complemented those 222 videos with 1000 sign language videos from How2Sign. TikTok video resolution is 1024x576 and transform into 3 channels. Frames number depends on each video by length. We need to compile all videos and count the number of frames. After the transformation of the video, we classify the video as using sign language and not using sign language and test the model.
In the Conv3D Classification Model, the resulting loss during training shows an expected training behavior with mini-batches. When selecting a small batch size, the loss function plot will have a non-smooth shape. Although we see this behavior, the model achieved 95% accuracy on epoch 17. Due to computational capacity, the maximum batch size we achieved was 5.
4. Discussion & Conclusions
Why is the translation poor in Encoder-Decoder Seq2Seq Model?
1. Computational Power
Due to restrictions in computational power, we were only able to train this model in 1 GPU and 8 core unit CPU. More computing capacity
would allow us to use 2 or 4 LSTM layers and achieve a better-hidden representation of spatiotemporal features of hands, face, and pose.
Likewise, having more computational power would allow us to input the complete number of frames per video. Thus, feeding training models with more data.
2. Data Selection
How2Sign provides both frontal and side views. Due to hardware limitations, we only select the frontal one. A more completed data input may return a better accuracy.
3. Linguistics Issue
Since we would like to translate the whole sentence, we are unable to remove the stop words. For example, I love you have totally different meaning with She loves me. But if we remove all stop words, the translation of these two sentences would all turn out to be “Love.” To overcome this issue requires a more complicated model building.
Moreover, one of the features of sign language is the speed of motion, and facial expressions should all be included. Although we generate the landmark of the whole body, including the face, our model seems to lack the ability to differentiate the prediction of the next position gesture of the interpreter.
4. Model Complexity
Our previous literature review made us aware that many studies separate the gesture, pose, and facial expression into different channels with a complicated CNN model, which our model is short of.
What could we do to improve for the future?
- A possible solution for the translation model would be a combination of a Conv+LSTM Encoder and LSTM Decoder.
- An appropriated text processing and relabeling the sentence.
- Apply relative coordinate encoding of the landmarks.
- Train the model according to the topic of the sentence.
How2Sign categorized their dataset into 10 topics. If we could narrow the subject down first may improve the accuracy of our models.
- Include both the frontal view and side view video of the dataset.
- Find an appropriate loss function to feed the models with feedback and improve their performance.
- Try the Transformer instead of LSTM, since Transformer can process a whole sentence other than word by word.
- Introduce BLEU (bilingual evaluation understudy) to evaluate the quality of our machine translation.
- How2Sign Dataset
- Duarte A., Palaskar S., Ventura L., Ghadiyaram D., DeHaan K., Metze F., Torres J., Giró-i-Nieto X.. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2735–2744.[Link]
- GloVe: Global Vectors for Word Representation
- Complete Guide to Spacy Tokenizer with Examples
- MediaPipe Holistic
- Sign Language Processing in GitHub
- Understanding LSTM Networks by Christopher Olah.
- Vrskova, R., Hudec R., Kamencay P. Human Activity Classification Using the 3D CNN Architecture. 2022. [Link]
- Sutskever I., Vinyals O., Quoc V. Sequence to Sequence Learning with Neural Networks. 2014. [Link]
- Bahdanau D., Cho K., Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. 2016. [Link]
- Teather D., TikTok Unofficial API.[Link]