Use the `subwords8k` pre-tokenized IMDB Reviews dataset. You will load it via Tensorflow Datasets

import tensorflow_datasets as tfds

# Download the subword encoded pretokenized dataset
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)

# Get the tokenizer
tokenizer = info.features['text'].encoder

Prepare the dataset

You can then get the train and test splits and generate padded batches.

Note: To make the training go faster in this lab, you will increase the batch size that Laurence used in the lecture. In particular, you will use 256 and this takes roughly a minute to train per epoch. In the video, Laurence used 16 which takes around 4 minutes per epoch.

BUFFER_SIZE = 10000
BATCH_SIZE = 256

# Get the train and test splits
train_data, test_data = dataset['train'], dataset['test'], 

# Shuffle the training data
train_dataset = train_data.shuffle(BUFFER_SIZE)

# Batch and pad the datasets to the maximum length of the sequences
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
test_dataset = test_data.padded_batch(BATCH_SIZE)

Build and compile the model

Now you will build the model. You will simply swap the Flatten or GlobalAveragePooling1D from before with an LSTM layer. Moreover, you will nest it inside a Biderectional layer so the passing of the sequence information goes both forwards and backwards. These additional computations will naturally make the training go slower than the models you built last week. You should take this into account when using RNNs in your own applications.

import tensorflow as tf

# Hyperparameters
embedding_dim = 64
lstm_dim = 64 # will be doubled in case of bidirectional layer!!!!!!!!
dense_dim = 64

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, embedding_dim),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_dim)),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Print the model summary
model.summary()

-———————————————— OR ————————————————

import tensorflow as tf

# Hyperparameters
embedding_dim = 64
lstm1_dim = 64
lstm2_dim = 32
dense_dim = 64

# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, embedding_dim),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm1_dim, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm2_dim)),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Print the model summary
model.summary()

# Set the training parameters
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
**NUM_EPOCHS = 10

history = model.fit(train_dataset, epochs=NUM_EPOCHS, validation_data=test_dataset)**

import matplotlib.pyplot as plt

# Plot utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

# Plot the accuracy and results 
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

Untitled

Week 3 Wrap up

You’ve been experimenting with NLP for text classification over the last few weeks. Next week, you’ll switch gears and take a look at using the tools that you’ve learned to predict text, which ultimately means you can create text. By learning sequences of words, you can predict the most common word that comes next in the sequence. Thus when starting from a new sequence of words, you can create a model that builds on them. You’ll take different training sets -- like traditional Irish songs, or Shakespeare poetry, and learn how to create new sets of words using their embeddings!

Use the subwords8k pre-tokenized IMDB Reviews dataset. You will load it via Tensorflow Datasets

Prepare the dataset

Build and compile the model

-———————————————— OR ————————————————

Week 3 Wrap up

Use the `subwords8k` pre-tokenized IMDB Reviews dataset. You will load it via Tensorflow Datasets