So I have a database of words between 3 and 20 characters long. I want to code something in PHP that finds all of the smaller words that are contained within a larger word.
Here is a simple solution that should be pretty efficient, but will only work up to certain size of words (probably about 15-20 characters it will break down, depending on whether the letters making up the word are low-frequency letters with lower values or high-frequency letters with higher values):
e
is 2, t
= 3, a
= 5, etc. using frequency values from here or some similar source.bigint
data type column. For instance, tea
would have a value of 3*2*5=30
. If a word has repeated letters, repeat the factor, so that teat
should have a value of 3*2*5*3=90
.rain
, is contained inside of another word, such as inward
, it's sufficient to check if the value for rain
divides the value for inward
. In this case, inward = 14213045
, rain = 7315
, and 14213045
is divisible by 7315
, so the word rain
is inside the word inward
.9223372036854775807
, which should be fine up to about 15-20 characters (depending on the frequencies of letters in the word). For instance, I picked up the first 20-letter word from here, which is anitinstitutionalism
, and has a value of 6901041299724096525
which would just barely fit inside the bigint column. However, the 14-letter word xylopyrography
has a value of 635285791503081662905
, which is too big. You might have to handle the really large ones as special cases using an alternate method, but hopefully there's few enough of them that it would still be relatively efficient.The query would work something like the demo I've prepared here: http://www.sqlfiddle.com/#!2/9bd27/8