Using word2vec to classify words in categories

别来无恙 提交于 2019-12-18 11:31:31

问题


BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. Take new input.
  2. Calculate it's similarity with each word in each vector and take an average.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

ISSUE

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.


回答1:


If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

Here's my solution below:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.




回答2:


Also, what its worth, PyTorch has a good and faster implementation of Glove these days.



来源:https://stackoverflow.com/questions/47666699/using-word2vec-to-classify-words-in-categories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!