How to translate SQL queries to cypher in the optimal way?

问题

I am new in neo4j using version 3.0. I have a huge transactional dataset that I converted to a graph model. I need to translate the below SQL query into cypher.

create table calc_base as
select a.ticket_id ticket_id, b.product_id, b.product_desc,
       a.promotion_flag promo_flag,
       sum(quantity) sum_units,
       sum(sales) sum_sales
from fact a
     inner join dimproduct b on a.product_id = b.product_id    
where store_id in (select store_id from dimstore)
and b.product_id in (select product_id from fact group by 1 order by count(distinct ticket_id) desc limit 5000)
group by 1,2,3,4;

Here is my ER diagram and corresponding graph model . My relationships for this query are:

MATCH (a:PRODUCT) 
MATCH (b:FACT {PRODUCT_ID: a.PRODUCT_ID}) 
CREATE (b)-[:HAS_PRODUCT]->(a);

MATCH (a:STORE) 
MATCH (b:FACT {STORE_ID: a.STORE_ID}) 
CREATE (b)-[:HAS_STORE]->(a);

My cypher translation for this query is :

PROFILE
MATCH (b:PRODUCT) 
MATCH (a:FACT)
MATCH (c:STORE)
CREATE (d:CALC_BASE {TICKET_ID: a.TICKET_ID, PRODUCT_ID: a.PRODUCT_ID, PRODUCT_DESC: b.PRODUCT_DESC,
        PROMO_FLAG: a.PROMOTION_FLAG, KPI_UNITS: SUM(a.QUANTITY_ABS),  KPI_SALES: SUM(a.SALES_ABS) }) 
    Q = (MATCH (e:FACT)
    WITH count(PRODUCT_ID) AS PRO_ID_NUM , COUNT(DISTINCT TICKET_ID) AS TICKET_ID_NUM
    ORDER BY  TICKET_ID_NUM DESC)
WHERE b.PRODUCT_ID = Q
ORDER BY TICKET_ID, PRODUCT_ID, PRODUCT_DESC, PROMO_FLAG

My main problem is defining group by and sub queries in cypher. How can I write this query into cypher in an optimal way?

回答1:

For one, there is no GROUP BY in Cypher, as the grouping columns are implicitly the non-aggregation columns in each row.

I'm assuming you have constraints and indexes set up? You'll need these set up correctly for performant queries.

A major red flag I'm seeing is that there are no relationships at all in these queries and likely in your entire data model. Graph databases are made to model relationships between things, and these tend to replace the concept of foreign keys in relational dbs. I'll speak more on better ways to model your data at the end.

That said, I'll take a stab at translating this with your current data model.

My approach is to go from the inside out. First let's get collections for allowed store_id and b.product_id values.

// first collect allowed STORE_IDs
MATCH (s:STORE)
WITH COLLECT(s.STORE_ID) as STORE_IDs
MATCH (e:FACT)
// now get PRODUCT_IDs with the most associated TICKET_IDs 
WITH STORE_IDs, e.PRODUCT_ID, COUNT(DISTICT e.TICKET_ID) as TICKET_ID_CNT
ORDER BY TICKET_ID_CNT DESC
LIMIT 5000
WITH STORE_IDs, COLLECT(e.PRODUCT_ID) as PRODUCT_IDs
// we now have 1 row with both collections, and will do membership checking with them later
// next get only PRODUCT nodes with PRODUCT_ID in the collection of allowed PRODUCT_IDs
MATCH (b:PRODUCT)
WHERE b.PRODUCT_ID in PRODUCT_IDs
WITH b, STORE_IDs
// now get FACT nodes with STORE_ID in the collection of allowed STORE_IDs
// and associated with PRODUCT nodes by PRODUCT_ID
MATCH (a:FACT)
WHERE a.STORE_ID in STORE_IDs
AND a.PRODUCT_ID = b.PRODUCT_ID
WITH a, b
// grouping is implicit, the non-aggregation columns are the grouping key
WITH a.TICKET_ID as TICKET_ID, b.PRODUCT_ID as PRODUCT_ID, b.PRODUCT_DESC as PRODUCT_DESC, a.PROMOTION_FLAG as PROMOTION_FLAG, SUM(a.QUANTITY) as SUM_UNITS, SUM(a.SALES) as SUM_SALES
CREATE (:CALC_BASE {TICKET_ID:TICKET_ID, PRODUCT_ID:PRODUCT_ID, PRODUCT_DESC:PRODUCT_DESC, PROMO_FLAG:PROMOTION_FLAG, SUM_UNITS:SUM_UNITS, SUM_SALES:SUM_SALES})

That should get you what you want.

And now back to the major problem with all this...you're using a graph db for non-graph data and queries. You're using foreign keys and attempting to join nodes rather than modeling these as relationships. You're also using abbreviated names, which makes it hard to figure out the meaning of your data and how it's supposed to relate to each other.

My advice to you is to rethink your data model, especially on how your data connects together. Look for where you're using foreign key joining, and instead think about how to replace that with relationships between your nodes, complete with the nature of those relationships.

Data modeled in a more graph-oriented way with relationships lends itself to more graph-oriented and performant queries, as well as a data model that is easier to understand and communicate to others.

EDIT

Now that you have relationships between different types of nodes, we can simplify the query a bit.

The approach will be similar, we will still go from the inside out rather than some inner subquery (though with Neo4j 3.1, pattern comprehension can be used like an inner query in various cases).

// first get products with the most tickets (top 5k)
MATCH (f:FACT)
WITH f.PRODUCT_ID as productID, COUNT(DISTICT f.TICKET_ID) as ticketIDCnt
ORDER BY ticketIDCnt DESC
LIMIT 5000
MATCH (p:PRODUCT)
WHERE p.PRODUCT_ID = productID
WITH p
// with those products, get related facts (graph equivalent of a join)
MATCH (p)<-[:HAS_PRODUCT]-(f:FACT)
// ensure the fact has a related store.
// if ALL facts have a related store, you don't need this WHERE clause
WHERE (f)-[:HAS_STORE]->(:STORE)
WITH f.TICKET_ID as TICKET_ID, p.PRODUCT_ID as PRODUCT_ID, p.PRODUCT_DESC as PRODUCT_DESC, f.PROMOTION_FLAG as PROMOTION_FLAG, SUM(f.QUANTITY) as SUM_UNITS, SUM(f.SALES) as SUM_SALES
CREATE (:CALC_BASE {TICKET_ID:TICKET_ID, PRODUCT_ID:PRODUCT_ID, PRODUCT_DESC:PRODUCT_DESC, PROMO_FLAG:PROMOTION_FLAG, SUM_UNITS:SUM_UNITS, SUM_SALES:SUM_SALES})

Again, you'll want to make sure there are indexes and unique constraints where appropriate in your data model to speed up your matches.

There are still several areas where you might want to think about modifying your data model (where it makes sense, of course). There is a concept of ticket IDs, but no :Ticket nodes. You have created :CALC_BASE nodes, but have not related them to to :Products or tickets. In general, it's useful to see where you're still using the concept of foreign keys, and seeing if it would be better to model these as relationships to other nodes.

And again on GROUP BY, this is handled for you in Cypher. Your rows are made up of non-aggregation columns, and aggregation columns. The non-aggregation columns are automatically used by Cypher as the grouping key (the equivalent of grouping by those columns). Since SUM_UNITS and SUM_SALES are the result of SUM() operations, which are aggregation functions, all the other columns are automatically used as the grouping key.

来源：https://stackoverflow.com/questions/41005125/how-to-translate-sql-queries-to-cypher-in-the-optimal-way

标签

postgresql

neo4j

cypher