Generating Sequences and Padding

converting your input sentences into a sequence of tokens. Similar to images in the previous course, you need to prepare text data with uniform size before feeding it to your model. You will see how to do these in the next sections.

Text to sequence

<aside> 💡 That is done using the texts_to_sequences() method as shown below.

</aside>

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define your input texts
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")

# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)

# Get the word index dictionary
word_index = tokenizer.word_index

# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Print the result
print("\\nWord Index = " , word_index)
print("\\nSequences = " , sequences)

Word Index =  {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

Padding

you will usually need to pad the sequences into a uniform length because that is what your model expects. You can use the pad_sequences for that. By default, it will pad according to the length of the longest sequence. You can override this with the maxlen argument to define a specific length. Feel free to play with the other arguments shown in class and compare the result.

# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5, truncating = 'post')

# Print the result
print("\\nPadded Sequences:")
print(padded)

Padded Sequences:
[[0 5 3 2 4]
 [0 5 3 2 7]
 [0 6 3 2 4]
 [8 6 9 2 4]]

Generating Sequences and Padding

Text to sequence

Padding

Out of vocabulary tokens????