Word Embeddings
RNNs are often used to process language, for example to map from one sequence of
words to another sequence of words. RNNs, like other NNs, take vectors as input,
so to get them to process words we need some way of turning words into vectors.
One way to do this is with a one-hot encoding (like in Section 3). But there are
lots of other ways of coming up with lower dimensional but more informative
representations of words called "word embeddings".
One popular technique for producing word embeddings is called word2vec. You can
read more about it on
wikipedia, but briefly, it tries to assign each word a vector embedding such
that pairs of words that appear in a training sentence together have embeddings
that are close in vector space.
Here, we investigate a phenomenon that will affect any RNN that uses these
embeddings. We've trained embeddings for all the words that appeared in a large
dataset of news articles. Instead of loading embeddings for every word in this
set (together they take up about 3 GB!), we have pre-selected several words
whose embeddings have interesting properties.
`words = ["woman", "man", "boy", "girl", "doctor", "nurse",
"programmer", "homemaker", "queen", "king", "receptionist",
"librarian", "socialite", "hairdresser", "nanny", "bookkeeper",
"stylist", "maestro", "protege", "philosopher", "captain",
"architect", "surgeon", "brilliant", "mother", "father"]`
4A) The function below computes the cosine distance between two
embedding vectors:
d(w_1, w_2) = 1 - \frac{w_1 \cdot w_2}{||w_1|| ||w_2||}
[Note that the distance increases as the angle between two vectors increases.]
Find the distances between pairs of words from this list. Compare the distances
to each other. Which relative distances are as expected? Which are unexpected?
4B) In which cases might the unexpected distances have unintended
consequences? Why might this be happening?
Food for thought
Discussion Guide
4A) "mother" is closer to "woman" than it is to "man"; similarly, "father" is closer to "man" than to "woman". Same trend with "king"/"queen". However, non-gendered professions such as "nanny", "programmer" (!), "architect", and "receptionist" show a gender bias.
4B) Machine translation might show this bias (e.g., the doctor example in Spanish); resume processing, e.g. -- if someone had been a part of "Society of Women Engineers," that might move the embedding of the resume away from "programmer"