Searching names with Apache Solr

后端 未结 5 1264
抹茶落季
抹茶落季 2020-12-12 21:20

I\'ve just ventured into the seemingly simple but extremely complex world of searching. For an application, I am required to build a search mechanism for searching users by

5条回答
  •  一向
    一向 (楼主)
    2020-12-12 21:43

    We created a simple 'name' field type that allows mixing both 'key' (e.g., SOUNDEX) and 'pairwise' portions of the answers above.

    Here's the overview:

    1. at index time, fields of the custom type are indexed into a set of (sub) fields with respective values used for high-recall matching different kinds of variations

    Here's the core of its implementation...

    List createFields(SchemaField field, String name) {
            Collection nameFields = deriveFieldsForName(name);
            List docFields = new ArrayList<>();
            for (FieldSpec fs : nameFields) {
                docFields.add(new Field(fs.getName(), fs.getStringValue(),
                             fs.getLuceneField()));
            }
            docFields.add(createDocValues(field.getName(), new Name(name)));
            return docFields;
    }
    

    The heart of this is deriveFieldsForName(name) in which you can include 'keys' from PhoneticFilters, LowerCaseFolding, etc.

    1. at query time, first a custom Lucene query is produced that has been tuned for recall and that uses the same fields as index time

    Here's the core of its implementation...

    public Query getFieldQuery(QParser parser, SchemaField field, String val) {
            Name name = parseNameString(externalVal, parser.getParams());
            QuerySpec querySpec = buildQuery(name);
            return querySpec.accept(new SolrQueryVisitor(field.getName())); 
    }
    

    The heart of this is the buildQuery(name) method which should produce a query that is aware of deriveFieldsForName(name) above so for a given query name it will find good candidate names.

    1. then second, Solr’s Rerank feature is used to apply a high-precision re-scoring algorithm to reorder the results

    Here's what this looks like in your query...

    &rq={!myRerank reRankQuery=$rrq} &rrq={!func}myMatch(fieldName, "John Doe")
    

    The content of myMatch could have a pairwise Levenstein or Jaro-Winkler implementation.

    N.B. Our own full implementation uses proprietary code for deriveFieldsForName, buildQuery, and myMatch (see http://www.basistech.com/text-analytics/rosette/name-indexer/) to handle more kinds of variations that the ones mentioned above (e.g., missing spaces, cross-language).

提交回复
热议问题