Revision

Introduction:
- Word2Vec is a technique that converts words into vectors, capturing the semantic meaning of words based on their context.
Process:
- The process begins with random word vectors and iterates through each word in the corpus, aiming to predict surrounding words using these vectors.
Mathematical Model:
- The model used for prediction is represented as:
$$ P(o|c) = \frac{exp(u_o^T v_c)}{\sum_{w \in V} exp(u_w^T v_c)} $$

Here, $(P(o|c))$ is the probability of a word $(o)$ given the context $(c)$, and $(u)$ and $(v)$ are the ‘output’ and ‘input’ vectors of the words. The model enhances prediction accuracy over iterations.
Learning Outcome:
- The algorithm refines the vectors for better prediction of surrounding words, achieving word similarity and identifying meaningful directions in the vector space.

Word2Vec Parameters and Computations

Parameters:
- The Word2Vec model uses two sets of word vectors, represented as matrices U and V in the slide.
- U represents the ‘outside’ or ‘output’ word vectors, and V represents the ‘center’ or ‘input’ word vectors.
Computations:
- The dot product of a center word vector and an outside word vector (represented as (U . v4^T)) is calculated. This value represents the similarity between the two words.
- The softmax function is applied to the dot product to calculate the probabilities of each word being a context word for the center word.
- This is represented as (softmax(U . v4^T)).

Untitled

Model Characteristics:
- The slide mentions that the model makes the same predictions at each position, indicating that it’s a “Bag of Words” model.
- The goal of the model is to give a reasonably high probability estimate to all words that occur in the context often.

Let’s assume we have a vocabulary of 5 words: {Word1, Word2, Word3, Word4, Word5}. The dimension of the word vectors is 3.

The ‘center’ word vectors (V) could look like this:

    | v1_1  v1_2  v1_3 |   # Word1
    | v2_1  v2_2  v2_3 |   # Word2
V = | v3_1  v3_2  v3_3 |   # Word3
    | v4_1  v4_2  v4_3 |   # Word4
    | v5_1  v5_2  v5_3 |   # Word5

And the ‘outside’ word vectors (U) could look like this:

    | u1_1  u1_2  u1_3 |   # Word1
    | u2_1  u2_2  u2_3 |   # Word2
U = | u3_1  u3_2  u3_3 |   # Word3
    | u4_1  u4_2  u4_3 |   # Word4
    | u5_1  u5_2  u5_3 |   # Word5

In these matrices, each row corresponds to a word in the vocabulary, and each column corresponds to a dimension of the word vector. The values v#_# and u#_# represent the coordinates of the word vectors in the vector space.

Optimization: Gradient Descent

Introduction:
- Gradient Descent is an optimization algorithm used to minimize a cost function (J(\theta)) by iteratively adjusting the parameters (\theta).
Process:
- The algorithm starts with an initial random value of $(\theta)$
- It calculates the gradient of $(J(\theta))$ at this point, which gives the direction of the steepest increase.
- It then takes a small step in the opposite direction (negative gradient) in the hope of decreasing $(J(\theta))$

Untitled

This process is repeated until it reaches a point where $(J(\theta))$ can no longer be decreased significantly, which is considered the minimum of the function.