Get the most repeated similar fields in MySQL database

Let's assume we have a database like:

Actions_tbl:

--------------------------------------------------------
id | Action_name                              | user_id|
--------------------------------------------------------
1  |  John reads one book                     | 1     
2  |  reading the book by john                | 1
3  |  Joe is jumping over fire                | 2
4  |  reading another book                    | 2
5  |  John reads the book in library          | 1
6  |  Joe read a    book                      | 2
7  |  read a book                             | 3
8  |  jumping with no reason is Ronald's habit| 3

Users_tbl:

-----------------------
user_id |    user_name |
-----------------------
1       |     John
2       |     Joe
3       |     Ronald
4       |     Araz
-----------------------

Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

Like: Araz reads the book

My Idea was to

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

and then check the similarities one by one in php using

levenshtein()

But this one doesn't have performance at all!

Assume that I want to do the same thing for a big db and for few different tables. This will destroy my server!!!

Any better IDEA?

in http://www.artfulsoftware.com/infotree/queries.php#552 the levenshtein() function is implemented as a MySQL function but firstly, do u think it has enough performance? and then, how to use it in my case? Maybe a self-join van fix this issue but I'm not that good with sql!

* similar action, are the actions that have more than X% common words

** More information and notes:**

I'm limited to PHP and MySQL.
This is just an example, in my real project the actions are long paragraphs. That's why the performance is a matter. The real scenario is: user inputted the description of its project for several projects, those data may be too similar(users would have the same area of work), I want to fill automatically(base on previous fillings) the description of next project, to save time.
I would appreciate if you can have any pragmatical Solution. I checked the NLP related solutions, although they r interesting, but I don't think many of them can be accurate and can be implemented using PHP.
The output should make sense and be a proper paragraph like all other projects. That's why I was thinking of choosing from previous ones.

Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

MvG

First off, you'll have to decide whether you want to compare a given input to all existing texts, or do a pairwise comparison of all texts. Your question asks for the latter, but the application you outline sounds more like the former.

If you compare only a single input with your database, I then I'd have hoped levenshtein distance computation to be fast enough up to medium database sizes. And there probably will be few ways to make things any faster unless you store some form of intermediate data structure about the current content of your text base. Recomputing anything for every new input will probably be just as costly.

If you want to do a comparison for every pair, then a levenshtein computation for each of them will take too much time. You'll have to devise some other concept of similarity. The first thing that comes to my mind, which would be somewhat resilient to different forms of a word, would be a suffix tree. You could insert all paragraphs into that tree. Where suffix trees normally store a single pointer, you might want to store a pair of indices, one identifying the database row and the other denoting a position in the text of that row. After building the tree, you could traverse it to identify common substrings, and increment some similarity counter for the corresponding pair. You'll have to experiment a bit to tune this measure. You might want to impose a minimum length for a common string before you increment a counter. As long texts have a larger chance of common words even if they are semantically unrelated, you might have to compensate for length in some way. I doubt there is a canonical way to do this.

The term-document matrix approach suggested by Gordon sounds interesting as well, and you should be able to implement that in PHP, too. That approach will be mor sensitive to changes of word form, even if the root is the same. On the other hand, it might be easier to keep a suitable matrix for that stored in your database, and to keep that structure in sync when you update your main text table. Both of these approaches have a fundamental difference to levenshtein distance: they care less about the overall order. I belive that this is a good thing in your case, because they'll consider the texts “John read a book after he went swimming in the lake” more similar to “After swimming in the lake, Joe read a book” than levenshtein distance would.

Your example indicates that you not only want to rank similarities, but also decide on cluser boundaries, I.e. say “these form a group” and “those belong to distinct groups”. There won't be a clean cut-off for this, so you'll have to experiment with heuristics for that as well. Unless always chosing the most similar text, or the k most similar texts, is enough for your application. In any case, I'd concentrate on the similarity computation first, and add things like user name replacement later on.

来源：https://stackoverflow.com/questions/11538409/get-the-most-repeated-similar-fields-in-mysql-database

标签

php

mysql

sql

levenshtein-distance