Text Generation before Transformers

It's important to note that generative algorithms are not new. Previous generations of language models made use of an architecture called recurrent neural networks or RNNs.

RNNs

RNNs while powerful for their time, were limited by the amount of compute and memory needed to perform well at generative tasks.

The prediction's quality is limited with just a single previous word visible to the model. Increasing the RNN's capability to consider more preceding words requires substantial resource expansion. Despite scaling up the model, accurate prediction remains elusive due to inadequate input exposure. Effective next-word prediction demands a broader context beyond a handful of prior words—ideally encompassing the entire sentence or document.

Untitled

Untitled

Untitled

Untitled

<aside> 💡 The challenge lies in the intricacies of language. Numerous languages feature words with multiple interpretations, known as homonyms. Context within a sentence clarifies the intended meaning

</aside>

Untitled

such as, distinguishing between types of "bank." Words positioned in sentence structures can introduce ambiguity or what we term as syntactic ambiguity.

Transformers

<aside> 💡 In 2017, the "Attention is All You Need" paper by Google and the University of Toronto introduced the transformer architecture. This approach revolutionized generative AI, enabling efficient scaling with multi-core GPUs, parallel processing of larger datasets, and a focus on understanding word meaning through attention. The title encapsulates its essence: "attention is all you need.”

</aside>

<aside> 💡 This article is considered to be the start of generative AI:

</aside>

Attention is All You Need – Google Research

Transformers Architecture

The transformer architecture excels at understanding the meaning and context of all words in a sentence. It doesn't just consider nearby words but includes every word. Through attention weights, the model captures how each word relates to others, regardless of their position. This enables the algorithm to identify the book's owner, possible possessors, and its relevance within the overall document context.

Untitled

Untitled

Untitled

Attention weights are acquired during Language Model (LLM) training. The visual representation, known as an attention map, depicts the weights between each word and all others. In this simplified instance, "book" prominently relates to "teacher" and "student." This self-attention mechanism, fostering comprehension across the entire input, greatly enhances the model's language encoding proficiency.

The transformer architecture comprises two distinct segments: the encoder and the decoder. These elements collaborate and exhibit several commonalities. Observe the model's input located at the bottom, while outputs emerge at the top.

Untitled

Untitled

Untitled

Machine-learning models are essentially advanced statistical calculators that work with numbers, not words. Texts must be tokenized before input, converting words into numbers based on a predefined dictionary. Various tokenization methods exist, like full-word token IDs or partial word representations. Consistency in tokenizer use is crucial between training and text generation.

Words become numbers using tokenization, where each number corresponds to a word's position in a model-friendly dictionary. Tokenization methods vary, such as matching IDs to words or parts of words. Consistency between training and text generation with the chosen tokenizer is key.

Once text turns numeric, it enters the embedding layer—a trainable vector space. Here, tokens become distinct vectors, encoding meaning and context within the input sequence.

Untitled