I\'m trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what
One possibility would be to store all the variants that you're thinking might be useful as an array element — not sure whether that might be possible though!
{
"number" : "pl",
"surface_form" : "skrejjen",
"surface_forms: [ "skrej", "skre" ],
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
}
I would probably also suggest to not store 1000 word forms with each word, but turn this around to have smaller documents. The smaller your documents are, the less MongoDB would have to read into memory for each search (as long as the search conditions don't require a full scan of course):
{
"word": {
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
},
"form" : {
"number" : "sg",
"surface_form" : "skrun",
"phonetic" : "ˈskruːn",
"gender" : "m"
},
"source" : "Mayer2013"
}
{
"word": {
"pos" : "N",
"lemma" : "skrun",
"gloss" : "screw",
},
"form" : {
"number" : "pl",
"surface_form" : "skrejjen",
"phonetic" : "'skrɛjjɛn",
"pattern" : "CCCVCCVC"
},
"source" : "Mayer2013"
}
I also doubt that MySQL would be performing better here with searches for random word forms as it will have to do a full table scan just as MongoDB would be. The only thing that could help there is a query cache - but that is something that you could build in your search UI/API in your application quite easily of course.