Neo4j regex string matching not returning expected results

烈酒焚心 提交于 2019-12-07 19:51:33

问题


I'm trying to use the Neo4j 2.1.5 regex matching in Cypher and running into problems.

I need to implement a full text search on specific fields that a user has access to. The access requirement is key and is what prevents me from just dumping everything into a Lucene instance and querying that way. The access system is dynamic and so I need to query for the set of nodes that a particular user has access to and then within those nodes perform the search. I would really like to match the set of nodes against a Lucene query, but I can't figure out how to do that so I'm just using basic regex matching for now. My problem is that Neo4j doesn't always return the expected results.

For example, I have about 200 nodes with one of them being the following:

( i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})

This query produces one result:

MATCH (p)-->(:group)-->(i:node)
  WHERE (i.name =~ "(?i).*mosaic.*")
  RETURN i

> Returned 1 row in 569 ms

But this query produces zero results even though the description property matches the expression:

MATCH (p)-->(:group)-->(i:node)
  WHERE (i.description=~ "(?i).*mosaic.*")
  RETURN i

> Returned 0 rows in 601 ms

And this query also produces zero results even though it includes the name property which returned results previously:

MATCH (p)-->(:group)-->(i:node)
  WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
  WHERE (searchText =~ "(?i).*mosaic.*")
  RETURN i

> Returned 0 rows in 487 ms

MATCH (p)-->(:group)-->(i:node)
  WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
  RETURN searchText

>
...
SotoLinear Glass Mosaic Tiles Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
...

Even more odd, if I search for a different term, it returns all of the expected results without a problem.

MATCH (p)-->(:group)-->(i:node)
  WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
  WHERE (searchText =~ "(?i).*plumbing.*")
  RETURN i

> Returned 8 rows in 522 ms

I then tried to cache the search text on the nodes and I added an index to see if that would change anything, but it still didn't produce any results.

CREATE INDEX ON :node(searchText)

MATCH (p)-->(:group)-->(i:node)
  WHERE (i.searchText =~ "(?i).*mosaic.*")
  RETURN i

> Returned 0 rows in 3182 ms

I then tried to simplify the data to reproduce the problem, but in this simple case it works as expected:

MERGE (i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})

WITH i, (
  i.name + " " + COALESCE(i.description, "")
) AS searchText

WHERE searchText =~ "(?i).*mosaic.*"
RETURN i

> Returned 1 rows in 630 ms

I tried using the CYPHER 2.1.EXPERIMENTAL tag as well but that didn't change any of the results. Am I making incorrect assumptions on how the regex support works? Is there something else I should try or some other way to debug the problem?

Additional information

Here is a sample call that I make to the Cypher Transactional Rest API when creating my nodes. This is the actual plain text that is sent (other than some formatting for easier reading) when adding nodes to the database. Any string encoding is just standard URL encoding that is performed by Go when creating a new HTTP request.

{"statements":[
    {
    "parameters":
        {
        "p01":"lsF30nP7TsyFh",
        "p02":
            {
            "description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
            "id":"lsF3BxzFdn0kj",
            "name":"Linear Glass Mosaic Tiles",
            "object":"material"
            }
        },
    "resultDataContents":["row"],
    "statement":
        "MATCH (p:project { id: { p01 } })
        WITH p

        CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material  { p02 })"
    }
]}

If it is an encoding issue, why does a search on name work, description not work, and name + description not work? Is there any way to examine the database to see if/how the data was encoded. When I perform searches, the text returned appears correct.


回答1:


just a few notes:

  • probably replace create unique with merge (which works a bit differently)
  • for your fulltext search I would go with the lucene legacy index for performance, if your group restriction is not limiting enough to keep the response below a few ms

I just tried your exact json statement, and it works perfectly.

inserted with

curl -H accept:application/json -H content-type:application/json -d @insert.json \
     -XPOST http://localhost:7474/db/data/transaction/commit

json:

{"statements":[
    {
    "parameters":
        {
        "p01":"lsF30nP7TsyFh",
        "p02":
            {
            "description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
            "id":"lsF3BxzFdn0kj",
            "name":"Linear Glass Mosaic Tiles",
            "object":"material"
            }
        },
    "resultDataContents":["row"],
    "statement":
        "MERGE (p:project { id: { p01 } })
        WITH p

        CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material  { p02 }) RETURN m"
    }
]}

queried:

MATCH (p)-->(:group)-->(i:material)
 WHERE (i.description=~ "(?i).*mosaic.*")
 RETURN i

returns:

name:   Linear Glass Mosaic Tiles
id: lsF3BxzFdn0kj
description:    Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
object: material

What you can try to check your data is to look at the json or csv dumps that the browser offers (little download icons on the result and table-result)

Or you use neo4j-shell with my shell-import-tools to actually output csv or graphml and check those files.

Or use a bit of java (or groovy) code to check your data.

There is also the consistency-checker that comes with the neo4j-enterprise download. Here is a blog post on how to run it.

java -cp 'lib/*:system/lib/*' org.neo4j.consistency.ConsistencyCheckTool /tmp/foo

I added a groovy test script here: https://gist.github.com/jexp/5a183c3501869ee63d30

One more idea: regexp flags

Sometimes there is a multiline thing going on, there are two more flags:

  • multiline (?m) which also matches across multiple lines and
  • dotall (?s) which allows the dot also to match special chars like newlines

So could you try (?ism).*mosaic.*



来源:https://stackoverflow.com/questions/26571379/neo4j-regex-string-matching-not-returning-expected-results

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!