GSoC 2018 - Audio-Visual Speech Recognition using Deep Learning

1 minute read

Study previous work done by GSoC 2017 candidate and propose improvements
Implement speaker recognition in a video using audio-visual modalities without having any labelled training data (SyncNet)
Implement completely end to end Audio Visual Speech recognition pipeline by using the model described in the paper Lip Reading Sentences in the Wild

Doing a literature review to identify state-of-the art implementations for Audio-Visual Speech Recognition
Speaker recognition in a video by using the model, Out of time: automated lip sync in the wild (SyncNet)
LRW-Sentences model architecture defined by using TensorFlow
Data processing pipeline to process visual data and make batches of visual cube tensors mentioned in the paper for passing them into Convolutional Neural Network

Integrate visual processing pipeline as an input to the model by using new tf.data API
Performing training and validation using just a visual data, as mentioned in the, “Watch-Attend-Spell” section of the paper
Add support for audio part mentioned in the section, “Listen-Attend-Spell” of the paper

Training or LRW is currently performed on the The GRID audiovisual sentence corpus which has near same recording environment for every speaker this could further be improved upon by using data having varied background location like news videos
Currently Speaker recognition in a video frame (SyncNet) and LRW Sentences are two separate implementations; this could be brought under single roof so that AVSR model works even in the cases where there are multiple speakers in the video frame
CNN architecture used for extracting visual features was that of VGG-M but recent work research suggests using ResNet for better performance as mentioned in the paper, Combining Residual Networks with LSTMs for Lipreading

FAQ answering chatbot using open-source chatbot framework Rasa Stack