Indexing Json Object Arrays in Lucene.NET

问题

I am working on putting arbitrary json objects into a Lucene.NET index, given an object that might look like:

{
  name: "Tony",
  age: 40,
  address: {
     street: "Weakroad",
     number: 10,
     floor: 2,
     door: "Left"
  },
  skills: [ 
    { name: ".NET", level: 5, experience: 12 },
    { name: "JavaScript", level: 3, experience: 6 },
    { name: "HTML5", level: 4, experience: 6 },
    { name: "Lucene.NET", level: 1, experience: 12 },
    { name: "C#", level: 10, experience: 12 }
  ],
  aliases: [ "Bucks", "SirTalk", "BeemerBoy" ]
}

That would produce the following fields:

"name": "Tony"
"age": "40"
"address.street": "Weakroad"
"address.number": "10"
"address.floor": "2"
"address.door": "Left"
"skills": ???
"aliases": "Bucks SirTalk BeemerBoy" //should turn into 3 tokens.

As you may noticed skills has a ???, because right now I am not sure how to deal with that... And if there even is any "meaningful-generic" way to do it...

Here are some options I have been able to think about:

1) Concatenation: But then I will lose the ability to do more advanced queries against Lucene, like finding persons with .NET skills above level 4 AFAIK?

For clarification, concatenation could be something like:

"skills": ".NET, JavaScript, HTML5, Lucene.NET, C#"

Discarding numbers as they wouldn't make much sense in this case. If aditional properties on a child object was a string that would have been gathered as well... An alternative would be to concat each field independently:

"skills.name": ".NET, JavaScript, HTML5, Lucene.NET, C#"
"skills.level": "5, 3, 4, 1, 10"
"skills.experience": "12, 6, 6, 12, 12"

Again numbers doesn't make all that much sense here, but added them just for providing an example.

2) Linked Documents: Creating a new document pr. array entry with a back reference to this document, this might work but without new features as Nested Documents and BlockJoinQuery which hasn't been ported to the .NET version yet this really sounds messy + it sounds like it would tank performance. While it would also kill the usefulness of document scoring, I think that might be less of an issue though.

Basically a document would contain a stored field acting as a foreign key, whenever a search found that document we would pick up the referenced document instead.

So if we illustrate documents they would be:

//Primary Document - ContentType: Person
"$id": 1
"$doctype": Primary
"name": "Tony"
...etc
"skills": [ 2, 3 ] //Just a stored field for retrieving data

//Child Document - ContentType: Skill
"$id": 2
"$ref": 1
"$doctype": Secondary
"name": ".NET"
"level": 5
"experience": 12

//Child Document - ContentType: Skill
"$id": 3
"$ref": 1
"$doctype": Secondary
"name": "JavaScript"
"level": 3
"experience": 6

etc.

I have added a some meta fields

3) A third Option I have found since is to Index the properties as the multiple fields with the same name, so the above example would then result in:

// index: 0
"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
// index: 1
"skills.name": "JavaScript"
"skills.level": 3
"skills.experience": 6
// index: 2
"skills.name": "HTML5"
"skills.level": 4
"skills.experience": 6
// index: 3
"skills.name": "Lucene.NET"
"skills.level": 1
"skills.experience": 12
// index: 4    
"skills.name": "C#"
"skills.level": 10
"skills.experience": 12

This is supported by Lucene.NET, yet it still leaves me behind on the demand to query like: [skill.name: ".NET" AND skill.level: [3 TO 5]].

But since this does allow me to search in the fields separately, I might be able to solve the other issue in another way by:

a) adding an extra combined field.
b) make Post validations in a collector on the results.
c) combination of the above

All depending on the data, obviously sticking to post validation of data like the above would yield really bad results as I am likely to get allot of false hits. It will still filter out people without .NET skills however which is a good thing.

But At least so far I am a step closer, I think.

Taken the scenario above, we can now have: (shortened greatly)

[{
  name: "Tony",
  skills: [ 
    { name: ".NET",       level: 1 },
    { name: "JavaScript", level: 3 },
    { name: "HTML5",      level: 5 }
  ]
 },
 {
  name: "Peter",
  skills: [ 
    { name: ".NET",       level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Lucene.NET", level: 1 }
  ]
 },
 {
  name: "Marilyn",
  skills: [ 
    { name: "JavaScript", level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Node",       level: 1 }
  ]
 }]

What we get is 3 documents with duplicate fields for skills.name and skills.level, that's fine... And I can actually search for { skills.name: 'JavaScript', skills.level: [1 TO 5] } which correctly returns Marilyn and Tony.

But if I search for { skills.name: 'JavaScript', skills.level: [4 TO 5] } I obviously still get both of them with this way of structuring the document where I should only have gotten Marilyn as a result.

Hence the need for a post filtering that will reject Tony as an actual match...

回答1:

For now I ended up Accepting the Limitations of Solution 3, the rationality for that is that If it's needed to query data in that way, data should be structured differently in the index (in line with Solution 2).

But I have chosen to move that decision outside if a possible framework handling this. As a result I have created https://github.com/dotJEM/json-index

回答2:

Adding on to Option 3, you could try indexing "skills" separately, i.e. something like this:

"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
"skills": "name .NET level 5 experience 12"

This way you can do a query like this:

skills: ("name .NET" AND "level 5" AND "experience 12")

来源：https://stackoverflow.com/questions/22465256/indexing-json-object-arrays-in-lucene-net

标签

arrays

lucene.net