Teacher forcing with pytorch RNN

Amit Source

The pytorch tutorials do a great job of illustrating a bare-bones RNN by defining the input and hidden layers, and manually feeding the hidden layers back into the network to remember the state. This flexibility then allows you to very easily perform teacher forcing.

Question 1: How do you perform teacher forcing when using the native nn.RNN() module (since the entire sequence is fed at once)? Example simple RNN network would be:

class SimpleRNN(nn.Module):

    def __init__(self, vocab_size,

        super(SimpleRNN, self).__init__()

        self.batch_sz = batch_sz
        self.hidden_size = hidden_size

        self.encoder = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_size, nlayers, dropout=0.5)
        self.decoder = nn.Linear(hidden_size, vocab_size)

    def init_hidden(self):
        return autograd.Variable(torch.zeros(nlayers, batch_sz, hidden_size)).cuda()

    def forward(self, inputs, hidden):

        # -- encoder returns:
        # -- [batch_sz, seq_len, embed_dim]
        encoded = self.encoder(inputs) 
        _, seq_len, _ = encoded.size()

        # -- rnn returns:
        # -- output.size() = [seq_len, batch_sz, hidden_sz]
        # -- hidden.size() = [nlayers, batch_sz, hidden_sz]
        output, hidden = self.rnn(encoded.view(seq_len, batch_sz, embedding_dim), hidden)

        # -- decoder returns:
        # -- output.size() = [batch_sz, seq_len, vocab_size]
        output = F.log_softmax(decoder(output.view(batch_sz, seq_len, self.hidden_size)))

        return output, hidden

Where I can call the network with:

model = SimpleRNN(vocab_size, embedding_dim, batch_sz).cuda()
x_data, y_data = get_sequence_data(train_batches[0])
output, hidden = model(x_data, model.init_hidden())

Just for completeness, here are my shapes of x_data, output, and hidden:

print(x_data.size(), output.size(), hidden.size())
torch.Size([32, 80]) torch.Size([32, 80, 4773]) torch.Size([1, 32, 128])

Question 2: would it be possible to use this SimpleRNN network to then generate a sequence word-by-word, by first feeding it a <GO_TOKEN> and iterating until an <END_TOKEN> is reached? I ask because when I run this:

x_data = autograd.Variable(torch.LongTensor([[word2idx['<GO>']]]), volatile=True).cuda()
output, hidden = model(x_data, model.init_hidden(1))

print(output, output.sum())

I get an output of all 0s, and the output.sum() = 0. I get this even after training the network and backpropagating the loss. Any ideas why?

Question 3: If not terribly inefficient, is it possible to train the SimpleRNN network above word-by-word, analogous to the pytorch tutorial shown (here)[http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html] (albeit there they're training character-by-character).



answered 3 months ago Oliver #1

Question 1.

Teacher forcing is performed implicitly in this case, since your x_data is [seq_len, batch_size] it will feed in each item in seq_len as input and not use the actual output for the next input.

Question 2.

Your model.init_hidden does not take any input however it looks like you're trying to add the batch size in, maybe you could check that, everything else seems fine. Though you will need to do an max() or multinomial() on the output before you can feed it back through.

Question 3.

Yes you can do this, yes it is terribly inefficient. This is a limitation of the CUDNN LSTM kernel

comments powered by Disqus