Revision
-
Introduction:
- Word2Vec is a technique that converts words into vectors, capturing the semantic meaning of words based on their context.
-
Process:
- The process begins with random word vectors and iterates through each word in the corpus, aiming to predict surrounding words using these vectors.
-
Mathematical Model:
- The model used for prediction is represented as:
$$
P(o|c) = \frac{exp(u_o^T v_c)}{\sum_{w \in V} exp(u_w^T v_c)}
$$
Here, $(P(o|c))$ is the probability of a word $(o)$ given the context $(c)$, and $(u)$ and $(v)$ are the ‘output’ and ‘input’ vectors of the words. The model enhances prediction accuracy over iterations.
-
Learning Outcome:
- The algorithm refines the vectors for better prediction of surrounding words, achieving word similarity and identifying meaningful directions in the vector space.
Word2Vec Parameters and Computations
-
Parameters:
- The Word2Vec model uses two sets of word vectors, represented as matrices U and V in the slide.
- U represents the ‘outside’ or ‘output’ word vectors, and V represents the ‘center’ or ‘input’ word vectors.
-
Computations:
- The dot product of a center word vector and an outside word vector (represented as (U . v4^T)) is calculated. This value represents the similarity between the two words.
- The softmax function is applied to the dot product to calculate the probabilities of each word being a context word for the center word.
- This is represented as (softmax(U . v4^T)).


- Model Characteristics:
- The slide mentions that the model makes the same predictions at each position, indicating that it’s a “Bag of Words” model.
- The goal of the model is to give a reasonably high probability estimate to all words that occur in the context often.
Let’s assume we have a vocabulary of 5 words: {Word1, Word2, Word3, Word4, Word5}. The dimension of the word vectors is 3.
The ‘center
’ word vectors (V) could look like this:
| v1_1 v1_2 v1_3 | # Word1
| v2_1 v2_2 v2_3 | # Word2
V = | v3_1 v3_2 v3_3 | # Word3
| v4_1 v4_2 v4_3 | # Word4
| v5_1 v5_2 v5_3 | # Word5
And the ‘outside
’ word vectors (U) could look like this:
| u1_1 u1_2 u1_3 | # Word1
| u2_1 u2_2 u2_3 | # Word2
U = | u3_1 u3_2 u3_3 | # Word3
| u4_1 u4_2 u4_3 | # Word4
| u5_1 u5_2 u5_3 | # Word5
In these matrices, each row corresponds to a word in the vocabulary, and each column corresponds to a dimension of the word vector. The values v#_#
and u#_#
represent the coordinates of the word vectors in the vector space.
Optimization: Gradient Descent
-
Introduction:
- Gradient Descent is an optimization algorithm used to minimize a cost function (J(\theta)) by iteratively adjusting the parameters (\theta).
-
Process:
- The algorithm starts with an initial random value of $(\theta)$
- It calculates the gradient of $(J(\theta))$ at this point, which gives the direction of the steepest increase.
- It then takes a small step in the opposite direction (negative gradient) in the hope of decreasing $(J(\theta))$

- This process is repeated until it reaches a point where $(J(\theta))$ can no longer be decreased significantly, which is considered the minimum of the function.