I\'m trying to use MongoDB to implement a natural language dictionary. I have a collection of lexemes, each of which has a number of wordforms as subdocuments. This is what
As suggested by Derick, I refactored the data in my database such that I have "wordforms" as a collection rather than as subdocuments under "lexemes".
The results were in fact better!
Here are some speed comparisons. The last example using hint
is intentionally bypassing the indexes on surface_form
, which in the old schema was actually faster.
Old schema (see original question)
Query Avg. Time
db.lexemes.find({"wordforms.surface_form":"skrun"}) 0s
db.lexemes.find({"wordforms.surface_form":/^skr/}) 1.0s
db.lexemes.find({"wordforms.surface_form":/skru/}) > 3mins !
db.lexemes.find({"wordforms.surface_form":/skru/}).hint('_id_') 2.8s
New schema (see Derick's answer)
Query Avg. Time
db.wordforms.find({"surface_form":"skrun"}) 0s
db.wordforms.find({"surface_form":/^skr/}) 0.001s
db.wordforms.find({"surface_form":/skru/}) 1.4s
db.wordforms.find({"surface_form":/skru/}).hint('_id_') 3.0s
For me this is pretty good evidence that a refactored schema would make searching faster, and worth the redundant data (or extra join required).