问题
Question
I am using the result of a myisam_ftdump to generate a search suggestions table. This process went smoothly, but many words appear in the index multiple times. Clearly, I could just SELECT distinct term FROM suggestions ORDER BY weight, but doesn't this penalize words for showing up more than once?
If it does, is there a concise formula for merging the rows?
If it does not, which rows should I keep (e.g., highest weighted, lowest weighted)?
Example Data
+-----+------------+----------+
| id | word | weight |
+-----+------------+----------+
| 670 | young | 0.416022 |
| 669 | york | 0.54944 |
| 668 | years | 0.281683 |
| 667 | years | 0.416022 |
| 666 | wrote | 0.416022 |
| 665 | written | 0.35841 |
| 664 | writing | 0.29518 |
| 663 | wright | 0.281683 |
| 662 | witness | 0.281683 |
| 661 | wiesenthal | 0.452452 |
| 660 | white | 0.35841 |
| 659 | white | 0.281683 |
| 658 | wgbh | 0.369332 |
| 657 | weighs | 0.35841 |
+-----+------------+----------+
See especially 'white' and 'years'.
回答1:
It looks like you ran myisam_ftdump -d. I think you want to use myisam_ftdump -c instead.
That will give you one row per word, along with a count of how many times that word appears in the index, and its global weight.
Here's the doc on -c vs. -d:
-c, --count Calculate per-word stats (counts and global weights).
-d, --dump Dump index (incl. data offsets and word weights).
来源:https://stackoverflow.com/questions/5225696/how-should-i-handle-duplicate-entries-weights-in-myisam-search-index