Word2Vec is it for word only in a sentence or for features as well?

问题

I would like to ask more about Word2Vec:

I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.

From my understanding;

1) Feature extractions : Lemma 0, lemma 1, lemma 2 2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))

More explanation:

Sentence = "I have a pen". Word = token of the sentence, for example, "have"

1) Feature extraction

"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:

[[0,0,1],
[1,0,0],
[0,1,0]]

2) Word embedding(Word2vec)

"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:

[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]

The floating and integer numbers are for explanation purpose and original data should vary depending on the sentence. These are just dummy data to explain.*

Questions:

1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec? 2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.

Hopefully someone could help me in this.

回答1:

It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)

"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.

One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...

['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']

...then you have 7 unique case-flattened words...

['a', 'pen', 'will', 'need', 'ink', 'i', 'have']

...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:

 [1, 1, 1, 1, 1, 0, 0]  # A pen will need ink
 [1, 1, 0, 0, 0, 1, 1]  # I have a pen

Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.

Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).

Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.

Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:

[0, 1, 0, 0, 0, 0, 0]  # 'pen'

If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:

[0.236, -0.711]  # 'pen'

All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):

[-0.101, 0.271]   # 'a'
[0.236, -0.711]   # 'pen'
[0.302, 0.293]    # 'will'
[0.672, -0.026]   # 'need'
[-0.198, -0.203]  # 'ink'
[0.734, -0.345]   # 'i'
[0.288, -0.549]   # 'have'

If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:

[1, 1, 0, 0, 0, 1, 1]  # I have a pen

...you'd get a single 2-dimensional dense vector like:

[ 0.28925, -0.3335 ]  # I have a pen

And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.

So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".

Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

来源：https://stackoverflow.com/questions/52379317/word2vec-is-it-for-word-only-in-a-sentence-or-for-features-as-well

标签

word2vec