问题
I've got a question: I have a lot of documents and each line built by some pattern. Of course, I have this array of patterns.
I want to create some vector space, then to vector this patterns by some rule (I have no ideas about what is this rule yet..) - i.e. to make this patterns like "centroids" of my vector space. Then to vector each line of the current document (again by this rule) and to count the closet centroid to this line (i.e. minimum of the distance between two vectors).
I don't know how can I do this? I know about sklearn libraries and CountVectorizer/TfidfVectorizer/HashingVectorizer - but this depends on the vocabulary size. But, again, I have a lot of documents, that's why it'll be too much words in vocabulary (if do this way, but in next new document it can be new word which this vocabulary wouldn't have. That's way this is wrong way to solve my problem) Also Keras library with it's Text Preprocessing won't solve my problem two. E.x. "one hot" encodes a text into a list of word indexes of size . But each document may have different size and of course the order. That's way comparing two vectors may give big distance, but in fact this vectors (words, that encoded by this vectors) are very similar.
来源:https://stackoverflow.com/questions/53607110/creating-vector-space