Searching names with Apache Solr

后端未结

关注

 5  1264

抹茶落季 2020-12-12 21:20

I\'ve just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by

5条回答

一向 (楼主)

2020-12-12 21:43
We created a simple 'name' field type that allows mixing both 'key' (e.g., SOUNDEX) and 'pairwise' portions of the answers above.

Here's the overview:
1. at index time, fields of the custom type are indexed into a set of (sub) fields with respective values used for high-recall matching different kinds of variations
Here's the core of its implementation...
```
List createFields(SchemaField field, String name) {
        Collection nameFields = deriveFieldsForName(name);
        List docFields = new ArrayList<>();
        for (FieldSpec fs : nameFields) {
            docFields.add(new Field(fs.getName(), fs.getStringValue(),
                         fs.getLuceneField()));
        }
        docFields.add(createDocValues(field.getName(), new Name(name)));
        return docFields;
}
```
The heart of this is deriveFieldsForName(name) in which you can include 'keys' from PhoneticFilters, LowerCaseFolding, etc.
1. at query time, first a custom Lucene query is produced that has been tuned for recall and that uses the same fields as index time
Here's the core of its implementation...
```
public Query getFieldQuery(QParser parser, SchemaField field, String val) {
        Name name = parseNameString(externalVal, parser.getParams());
        QuerySpec querySpec = buildQuery(name);
        return querySpec.accept(new SolrQueryVisitor(field.getName())); 
}
```
The heart of this is the buildQuery(name) method which should produce a query that is aware of deriveFieldsForName(name) above so for a given query name it will find good candidate names.
1. then second, Solr’s Rerank feature is used to apply a high-precision re-scoring algorithm to reorder the results
Here's what this looks like in your query...
```
&rq={!myRerank reRankQuery=$rrq} &rrq={!func}myMatch(fieldName, "John Doe")
```
The content of myMatch could have a pairwise Levenstein or Jaro-Winkler implementation.

N.B. Our own full implementation uses proprietary code for deriveFieldsForName, buildQuery, and myMatch (see http://www.basistech.com/text-analytics/rosette/name-indexer/) to handle more kinds of variations that the ones mentioned above (e.g., missing spaces, cross-language).
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...