How to return max counts per another node's properties

问题

I need to calculate how many times a composer's pieces of music were performed per decade, then return only the one piece with the most performances per decade.

This cypher does everything except filter all but the highest counts per decade.

match (c:Composer)-[:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program) 
WHERE c.lastname =~ '(?i).*stravinsky.*' 
WITH w.title AS Title, prog.title AS Program, LEFT(prog.date, 3)+"0" AS Decade
RETURN Decade, Title, COUNT(Program) AS Total
ORDER BY Decade, Total DESC, Title

I've been banging my head for hours with variations on this but can't find the solution.

回答1:

This seems to return what you're looking for but it can probably be improved.

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
RETURN Decade, HEAD(COLLECT(Total)) AS Total, HEAD(COLLECT(Title)) AS Title
ORDER BY Decade

It only returns one result from each decade but doesn't take ties into account, so it feels a little incomplete to me. I'll think about how to do that and edit if I come up with something good.

I used this string with http://graphgen.neoxygen.io to generate sample data locally.

(c:Composer {firstname: firstName, lastname: lastName} *10)<-[:CREATED_BY *n..1]-(w:Work {title: progLanguage} *75)<-[:PERFORMED *n..1]-(prog:Program {title: catchPhrase, date: date} *400)

VICTORY EDIT

This is the raw version of the above query that will show multiple Works when there are ties.

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
WITH Decade, Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
WITH Decade, [title in COLLECT(Title) WHERE Total = PerformedTotal] as Title, Total, PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade

I feel like it should be possible to refactor it but I can't seem to simplify it.

I have a ton of notes about the process of writing this answer. Even if it's not exactly what you're looking for, here's the TLDR cause it was still interesting.

Get rid of that fuzzy search if you can, find a way to index that property or use an external index like Elasticsearch. You take a massive performance hit when you use that regex.
There's a bug in Neo4j 2.2.M02 that makes the query crash if <-[*..2]- is changed to practically anything else. If you set the Cypher Query Planner to Cypher 2.1, performance is best if that very first line is MATCH (c:Composer)-[r:CREATED_BY]-(w)<-[r2:REL_TYPE]-(prog). Only use a label on that first node to help the WHERE do its job. Always always always use node and rel identifiers.
Cypher has some surprising behavior. That whole [title in COLLECT(Title) WHERE Total = PerformedTotal] is using variables from later in that same line. If I pull them out, it crashes.

More surprising behavior was that it hasn't been possible to refactor the way I'd expect. I'd expect to do this but can't:

MATCH (c:Composer)-[r:CREATED_BY]-(w:Work)<-[*..2]-(prog:Program)
WHERE c.lastname =~ '(?i).*stravinsky.*'
WITH LEFT(prog.date, 3)+"0" AS Decade, w.title AS Title, COUNT(prog.title) AS Total
ORDER BY Decade, Total DESC, Title
WITH Decade, [title in COLLECT(Title) WHERE Total = HEAD(COLLECT(Total))] as Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade

ANOTHER EDIT: HOW TO POSSIBLY SPEED IT UP

If you have a few potential paths your query may take but you want to avoid [*..2], you may be able to speed things up a bit by giving it specifics about the paths it should take when trying to find a match. Whether or not this is faster really depends on how many branches it can take that will be dead ends. If you can give it just two or three paths so it can completely ignore half a dozen other relationships, it will probably offset the filtering and things that happen later on. Of course, if the paths are complicated enough, this might be more trouble than it's worth.

You should pop this into the neo4j-shell and prepend PROFILE, add a semi-colon to the end, and look at the number of database accesses to determine which is best for your dataset.

MATCH (c:Composer)-[r:CREATED_BY]-(w)
WHERE c.lastname =~ '(?i).*Denesik.*'
OPTIONAL MATCH (w)-[r2:CONNECTED_TO]-(this_node)<-[r3:ONE_MORE]-(prog1)
OPTIONAL MATCH (w)<-[r4:PERFORMED]-(prog2)
OPTIONAL MATCH (w)-[r5:THIS_REL]->(this_node)-[r6:AGAIN_WITH_THE_RELS]->(prog3)
WITH FILTER(program in [prog1, prog2, prog3] WHERE program IS NOT NULL) AS progarray, w.title AS Title
UNWIND(progarray) as prog
WITH LEFT(prog.date, 3)+"0" AS Decade, COUNT(prog.title) AS Total, Title
ORDER BY Decade, Total DESC, Title
WITH Decade, Title, Total, HEAD(COLLECT(Total)) AS PerformedTotal
WITH Decade, [title in COLLECT(Title) WHERE Total = PerformedTotal] as Title, Total, PerformedTotal
ORDER BY PerformedTotal DESC
return Decade, HEAD(COLLECT(PerformedTotal)) as Totals, HEAD(COLLECT(Title)) as Titles
ORDER BY Decade;

The trickiest part of this is that if we reuse the prog variable, it's going to drag the results from each OPTIONAL MATCH into the next one, essentially trying to filter, and we won't get completely separate paths. (Why we're able to reuse w is sort of beyond me right now...) That's OK, though. We take the results, put them into an array, filter the empty results, then unwind it back to a single variable containing all the valid results. After that, we continue as normal.

In my tests, it seems like this can be significantly faster with the right dataset. YMMV.

来源：https://stackoverflow.com/questions/27850600/how-to-return-max-counts-per-another-nodes-properties

标签

neo4j

cypher