The pytorch tutorials do a great job of illustrating a bare-bones RNN by defining the input and hidden layers, and manually feeding the hidden layers back into the network to remember the state. This flexibility then allows you to very easily perform teacher forcing.
Question 1: How do you perform teacher forcing when using the native
nn.RNN() module (since the entire sequence is fed at once)? Example simple RNN network would be:
class SimpleRNN(nn.Module): def __init__(self, vocab_size, embedding_dim, batch_sz, hidden_size=128, nlayers=1, num_directions=1, dropout=0.1): super(SimpleRNN, self).__init__() self.batch_sz = batch_sz self.hidden_size = hidden_size self.encoder = nn.Embedding(vocab_size, embedding_dim) self.rnn = nn.RNN(embedding_dim, hidden_size, nlayers, dropout=0.5) self.decoder = nn.Linear(hidden_size, vocab_size) def init_hidden(self): return autograd.Variable(torch.zeros(nlayers, batch_sz, hidden_size)).cuda() def forward(self, inputs, hidden): # -- encoder returns: # -- [batch_sz, seq_len, embed_dim] encoded = self.encoder(inputs) _, seq_len, _ = encoded.size() # -- rnn returns: # -- output.size() = [seq_len, batch_sz, hidden_sz] # -- hidden.size() = [nlayers, batch_sz, hidden_sz] output, hidden = self.rnn(encoded.view(seq_len, batch_sz, embedding_dim), hidden) # -- decoder returns: # -- output.size() = [batch_sz, seq_len, vocab_size] output = F.log_softmax(decoder(output.view(batch_sz, seq_len, self.hidden_size))) return output, hidden
Where I can call the network with:
model = SimpleRNN(vocab_size, embedding_dim, batch_sz).cuda() x_data, y_data = get_sequence_data(train_batches) output, hidden = model(x_data, model.init_hidden())
Just for completeness, here are my shapes of
print(x_data.size(), output.size(), hidden.size()) torch.Size([32, 80]) torch.Size([32, 80, 4773]) torch.Size([1, 32, 128])
Question 2: would it be possible to use this
SimpleRNN network to then generate a sequence word-by-word, by first feeding it a
<GO_TOKEN> and iterating until an
<END_TOKEN> is reached? I ask because when I run this:
x_data = autograd.Variable(torch.LongTensor([[word2idx['<GO>']]]), volatile=True).cuda() output, hidden = model(x_data, model.init_hidden(1)) print(output, output.sum())
I get an
output of all 0s, and the
output.sum() = 0. I get this even after training the network and backpropagating the loss. Any ideas why?
Question 3: If not terribly inefficient, is it possible to train the
SimpleRNN network above word-by-word, analogous to the pytorch tutorial shown (here)[http://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html] (albeit there they're training character-by-character).
Teacher forcing is performed implicitly in this case, since your x_data is [seq_len, batch_size] it will feed in each item in seq_len as input and not use the actual output for the next input.
Your model.init_hidden does not take any input however it looks like you're trying to add the batch size in, maybe you could check that, everything else seems fine. Though you will need to do an max() or multinomial() on the output before you can feed it back through.
Yes you can do this, yes it is terribly inefficient. This is a limitation of the CUDNN LSTM kernel