Speech to language Translation Model

In this project we worked with the Wall Street Journal dataset. The task of this project was to transcribe a given speech utterance to its corresponding transcript.

The baseline model for this project is described in the Listen, Attend and Spell paper. The idea is to learn all components of a speech recognizer jointly. The paper describes an encoder-decoder approach, called Listener and Speller respectively. The Listener consists of a Pyramidal Bi-LSTM Network structure that takes in the given utterances and compresses it to produce high-level representations for the Speller network. The Speller takes in the high-level feature output from the Listener network and uses it to compute a probability distribution over sequences of characters using the attention mechanism. Attention intuitively can be understood as trying to learn a mapping from a word vector to some areas of the utterance map. The Listener produces a high-level representation of the given utterance and the Speller uses parts of the representation (produced from the Listener) to predict the next word in the sequence.

Another problem we faced was that given a particular state as input to our model, the model always generated the same next state output, this is because once trained, the model will give a fixed set of outputs for a given input state with no randomness. To introduce randomness in our prediction, we added some noise into our prediction (only during generation time) specifically the Gumbel noise. Obtained Levenshtein score of 8.9.