Get the most repeated similar fields in MySQL database

微笑、不失礼 提交于 2019-12-06 05:43:15

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

MvG

First off, you'll have to decide whether you want to compare a given input to all existing texts, or do a pairwise comparison of all texts. Your question asks for the latter, but the application you outline sounds more like the former.

If you compare only a single input with your database, I then I'd have hoped levenshtein distance computation to be fast enough up to medium database sizes. And there probably will be few ways to make things any faster unless you store some form of intermediate data structure about the current content of your text base. Recomputing anything for every new input will probably be just as costly.

If you want to do a comparison for every pair, then a levenshtein computation for each of them will take too much time. You'll have to devise some other concept of similarity. The first thing that comes to my mind, which would be somewhat resilient to different forms of a word, would be a suffix tree. You could insert all paragraphs into that tree. Where suffix trees normally store a single pointer, you might want to store a pair of indices, one identifying the database row and the other denoting a position in the text of that row. After building the tree, you could traverse it to identify common substrings, and increment some similarity counter for the corresponding pair. You'll have to experiment a bit to tune this measure. You might want to impose a minimum length for a common string before you increment a counter. As long texts have a larger chance of common words even if they are semantically unrelated, you might have to compensate for length in some way. I doubt there is a canonical way to do this.

The term-document matrix approach suggested by Gordon sounds interesting as well, and you should be able to implement that in PHP, too. That approach will be mor sensitive to changes of word form, even if the root is the same. On the other hand, it might be easier to keep a suitable matrix for that stored in your database, and to keep that structure in sync when you update your main text table. Both of these approaches have a fundamental difference to levenshtein distance: they care less about the overall order. I belive that this is a good thing in your case, because they'll consider the texts “John read a book after he went swimming in the lake” more similar to “After swimming in the lake, Joe read a book” than levenshtein distance would.

Your example indicates that you not only want to rank similarities, but also decide on cluser boundaries, I.e. say “these form a group” and “those belong to distinct groups”. There won't be a clean cut-off for this, so you'll have to experiment with heuristics for that as well. Unless always chosing the most similar text, or the k most similar texts, is enough for your application. In any case, I'd concentrate on the similarity computation first, and add things like user name replacement later on.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!