Search Number Plates using Solr

妖精的绣舞 提交于 2019-12-03 09:09:59

A number of PatternReplaceCharFilterFactory to convert number->letter (one per conversion you need to cover) plus a phonetic filter to match similar sounding words could work as a starting point.

You should do this both at index and query time. This should work...BUT you probably would want 'john' to match 'john' with a higher score than 'jo11n' right?

So you should use copyfields to match (with different boosts) several fields, one original, one with the number->letter conversion applied, one with the phonetic filter applied, etc. You can get as fancy as you need.

You might also write your own Analizer, but I would leave it for later, in case using the built in ones is not good enough.

I like Persimmonium's answer, I write to detail it a bit further. An analyzer might look like this:

<fieldType name="character_alias" class="solr.TextField">
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="synonym_characters.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
    </analyzer>
</fieldType>

I have chosen the MappingCharFilter instead of the suggested PatternReplaceCharFilterFactory as it allows to provide a list with characters that shall be replaced. This is more handy.

A synonym_character.txt might look like this

"11" => "H"
"12" => "R"
"4" => "A"

For the phonetic part I have chosen the BeiderMorseFilter. Although it is made for surnames, not given names, it delivers rather good results when running it with a small batch of samples from the site you have linked:

+--+---------+----------+
|id|namePlate|score     |
+--+---------+----------+
|2 |john     |1.2513144 |
+--+---------+----------+
|3 |jo11n    |1.2513144 |
+--+---------+----------+
|4 |jon 52   |0.54745007|
+--+---------+----------+
|6 |107 jon  |0.54745007|
+--+---------+----------+
|8 |jon 52   |0.54745007|
+--+---------+----------+
|5 |40 jon   |0.4692429 |
+--+---------+----------+
<fieldType name="character_alias" class="solr.TextField">
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="synonym_characters.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
    </analyzer>
</fieldType>

using this we can map

"H" => "11"
"4" => "A"
"8" => "A"

in this way it also map "4" => "8". I don't know to avoid this problem.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!