Hybrid search and indexing: words and token metadata in Solr

问题

I am building a set of plugins for Solr to enable a "hybrid" search which would match either words or token (not document!) metadata (specific ID numbers). Same words may have different ID numbers in different context, generated in indexing time by an external application. Such as, "run" may have 12345 in one case and 54321 in another (depends on the context). The ID numbers should have more weight in the search. (They will be provided in the query in search time by the same external application.)

I read about custom fields for documents and I was wondering if we could store a blob there with these IDs, but I am not sure how to include it in the search.

Or should I just pretend these IDs are "synonyms" (maybe surrounding them in some kind of unique marking, like [:12345:]) and use the synonym factory tokenizers?

I am new to Solr but I have read the relevant documentation so I think I understand how it all works conceptually. Performance does not matter at this stage, this is a PoC. Looks like somewhat similar to: Search different tokens on different fields in Solr but not exactly. Oh, and I want to tokenise the text myself, too, but that's not an issue.

EDIT: [removed the bit about payloads, it is irrelevant here. Sorry about the confusion]

回答1:

Unless I've misunderstood, as you've already generated the magic tokens, the only requirement is to see if the magic token value is present in a field, and if it is, score the field higher.

Index the magic token values to one field, and the textual values to another. Use boosting to prioritise matches in the magic token field over a match in the textual values field. The magic token field can probably be an integer field based on tint from your description.

When searching, you can generate the search string as:

q=(token:12345^5 OR text:run) AND (token:32145^5 OR text:fast)

This should give a match in the token a five times better score than a match in the text field. If you don't care if you match 12345 in the text field as well, you can use:

q=12345 run 32145 fast&qf=text token^5

You might have to tweak mm to give the required number of hits, depending on what your application needs.

来源：https://stackoverflow.com/questions/24581768/hybrid-search-and-indexing-words-and-token-metadata-in-solr

标签

solr

metadata

token