1. Instruction Fine Tuning

1.1 What is the need?

Untitled

The week covers methods to enhance the performance of an existing model for a specific use case. It also delves into important metrics for evaluating the fine-tuned LLM's performance and quantifying its improvement over the initial base model.

Let's start by discussing how to fine tune an LLM with instruction prompts.

Last week revealed that specific models have the capacity to discern instructions within a prompt, leading to precise zero-shot inference.

Untitled

Untitled

Yet, this method has a couple of downsides. First, it doesn't always work for smaller models, even with five or six examples. Second, any examples in your prompt use up space in the context window, reducing room for other important information.

Untitled

On the other hand, as evidenced by the given example, smaller LLMs may struggle.

Furthermore, you gain insight into the fact that presenting one or more task examples, termed as one-shot or few-shot inference, can assist the model in recognizing the task and generating an appropriate completion.

Untitled

1.2 What is Fine Tuning

You can use a method called fine-tuning to improve a base model further. Unlike pre-training, which involves training an LLM using lots of text, fine-tuning is more like supervised learning. It uses labeled examples to adjust the model's weights. These examples are pairs of prompts and their completions. This fine-tuning process helps the model get better at generating completions for a specific task.

Untitled

One useful strategy is called instruction fine-tuning, which can improve the model's performance on different tasks.

Untitled

Instruction fine-tuning involves training the model with examples that show how it should react to a particular instruction. Here are a few prompt examples to illustrate this concept. In both cases, the instruction is "classify this review," and the expected completion is a text string that begins with "sentiment" followed by either "positive" or "negative.”

Your training dataset contains prompt completion pairs for your specific task, each with an instruction. For instance, to improve summarization, you'd use examples starting with "summarize."

These examples help the model generate responses following the given instructions.

Untitled

For translation, instructions like "translate this sentence" would be included.

Untitled

Utilizing memory optimization and parallel computing strategies, as covered last week, can prove advantageous in this context.

Instruction fine-tuning, involving updates to all model weights, is called full fine-tuning. This process produces an updated model version with new weights.

Similar to pre-training, full fine-tuning demands sufficient memory and computational resources for storing and processing gradients, optimizers, and other training components.

1.2 But Dataset?

First, get your training data ready. There are datasets used for training earlier language models, but they may not have instructions. Luckily, there are prompt template libraries that can help. These templates can turn existing datasets, like Amazon product reviews, into instruction prompts for fine-tuning.

These libraries have templates for different tasks and datasets.

Here are three prompts designed for the Amazon reviews dataset, suitable for fine-tuning models in classification, text generation, and text summarization tasks. In each case, the original review (referred to as review_body) is fed into the template. The template starts with an instruction like "predict the associated rating," "generate a star review," or "give a short sentence describing the following product review."

Untitled

This creates a prompt that combines the instruction with an example from the dataset.

1.2.1 Divide The Dataset

With your instruction dataset ready, divide it into training, validation, and test sets. During fine-tuning, pick prompts from the training data and input them into the LLM. Compare the LLM's generated completions with expected responses from the data. In this case, the model's classification of the review as neutral seems a bit understated.

Untitled

1.3 Working with data