Creating vector space

问题

I've got a question: I have a lot of documents and each line built by some pattern. Of course, I have this array of patterns.

I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space. Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors).

I don't know how can I do this? I know about sklearn libraries and CountVectorizer/TfidfVectorizer/HashingVectorizer - but this depends on the vocabulary size. But, again, I have a lot of documents, that's why it'll be too much words in vocabulary (if do this way, but in next new document it can be new word which this vocabulary wouldn't have. That's way this is wrong way to solve my problem) Also Keras library with it's Text Preprocessing won't solve my problem two. E.x. "one hot" encodes a text into a list of word indexes of size . But each document may have different size and of course the order. That's way comparing two vectors may give big distance, but in fact this vectors (words, that encoded by this vectors) are very similar.

来源：https://stackoverflow.com/questions/53607110/creating-vector-space

标签

python

python-3.x

machine-learning

nlp

jupyter-notebook

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!