CBOW v.s. skip-gram: why invert context and target words?
In this page, it is said that: [...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...] However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y): (quick, brown), (brown, quick) So, why distinguish that much between context and targets if it is the same thing in the end? Also, doing Udacity's Deep Learning course exercise on word2vec , I wonder why they seem to do the difference between those two approaches that much in this problem: An alternative to