GSoC 2018 - Audio-Visual Speech Recognition using Deep Learning
Link to Repository of Code
Describe my work briefly
- Study previous work done by GSoC 2017 candidate and propose improvements
- Implement speaker recognition in a video using audio-visual modalities without having any labelled training data (SyncNet)
- Implement completely end to end Audio Visual Speech recognition pipeline by using the model described in the paper Lip Reading Sentences in the Wild
What is done
- Doing a literature review to identify state-of-the art implementations for Audio-Visual Speech Recognition
- Speaker recognition in a video by using the model, Out of time: automated lip sync in the wild (SyncNet)
- LRW-Sentences model architecture defined by using TensorFlow
- Data processing pipeline to process visual data and make batches of visual cube tensors mentioned in the paper for passing them into Convolutional Neural Network
TODO
- Integrate visual processing pipeline as an input to the model by using new tf.data API
- Performing training and validation using just a visual data, as mentioned in the, “Watch-Attend-Spell” section of the paper
- Add support for audio part mentioned in the section, “Listen-Attend-Spell” of the paper
Future directions
- Training or LRW is currently performed on the The GRID audiovisual sentence corpus which has near same recording environment for every speaker this could further be improved upon by using data having varied background location like news videos
- Currently Speaker recognition in a video frame (SyncNet) and LRW Sentences are two separate implementations; this could be brought under single roof so that AVSR model works even in the cases where there are multiple speakers in the video frame
- CNN architecture used for extracting visual features was that of VGG-M but recent work research suggests using ResNet for better performance as mentioned in the paper, Combining Residual Networks with LSTMs for Lipreading
Leave a Comment