This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts: - CNN as a feature extractor - Bidirectional LSTMs for temporal modeling - Connectionist Temporal Classification as a loss layer ![Network structure](https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/3269d3541f0eec006aee6ce086db2665b7ded92d/1-Figure1-1.png) Results: - Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets. - Utilizing full images rather than hand patches provides better performance for continuous SLR. - A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers. - Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences. - Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions.