Cassandra group by and filter results

问题

I'm trying to mimic something like this: Given a table test:

CREATE TABLE myspace.test (
item_id text,
sub_id text,
quantity bigint,
status text,
PRIMARY KEY (item_id, sub_id)

In SQL, we could do:

select * from (select item_id, sum(quantity) as quan 
               from test where status <> 'somevalue') sub 
where sub.quan >= 10;

i.e. group by item_id and then filter out the results with less than 10.

Cassandra is not designed for this kind of stuff though I could mimic group by using user-defined aggregate functions:

CREATE FUNCTION group_sum_state
   (state map<text, bigint>, item_id text, val bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$Long current = (Long)state.get(item_id); 
if(current == null) current = 0l; 
state.put(item_id, current + val); return state;$$;

CREATE AGGREGATE group_sum(text, bigint)
SFUNC group_sum_state
STYPE map<text, bigint>
INITCOND {  }

And use it as group by (probably this is going to have very bad performance, but still):

cqlsh:myspace> select group_sum(item_id, quantity) from test;

mysales_data.group_sum(item_id, quantity)
-------------------------------------------
     {'123': 33, '456': 14, '789': 15}

But it seems to be impossible to do filtering by map values, neither with final function for the aggregate nor with a separate function. I could define a function like this:

CREATE FUNCTION myspace.filter_group_sum
                (group map<text, bigint>, vallimit bigint)
CALLED ON NULL INPUT
RETURNS map<text, bigint>
LANGUAGE java
AS $$
java.util.Iterator<java.util.Map.Entry<String, Long>> entries = 
               group.entrySet().iterator(); 
while(entries.hasNext()) { 
    Long val = entries.next().getValue(); 
    if (val < vallimit) 
        entries.remove(); 
}; 
return group;$$;

But there is no way to call it and pass a constant:

select filter_group_sum(group_sum(item_id, quantity), 15) from test;
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] 
message="line 1:54 no viable alternative at input '15' 
(...(group_sum(item_id, quantity), [15]...)">

it complains about the constant 15.

Sorry for the long post, I need to provide all the details to explain what I need. So my questions are:

Is there a way pass in a constant to a user-defined function in Cassandra. Or what alternatives do I have to implemented filtered group by.
More general question: what is the proper data design for Cassandra to cover such a use-case for a real-time query-serving application? Say I have a web app that takes the limit from the UI and needs to return back all the items that total quantity larger than the given limit? The tables are going to quite large, like 10 billions of records.

回答1:

Vanilla Cassandra is a poor choice for ad hoc queries. DataStax Enterprise has added some of this functionality via integrations with Spark and Solr. The Spark integration is also open source, but you wouldn't want to do this for low-latency queries. If you need real-time queries, you're going to have to aggregate outside of Cassandra (in Spark or Storm, for example), then write back the aggregates to be consumed by your app. You can also look at Stratio's Lucene integration, which might help you for some of your queries.

回答2:

I ran across your question when looking for information on passing a constant to a user defined function.

The closest I can get to passing a constant is to pass a static column for the parameter for which you want to pass a constant. So if you update the static column before using the UDF, then you can pass that column. This will only work if you have a single client running such a query at a time, since the static column is visible to all clients. See this answer for an example:

Passing a constant to a UDF

来源：https://stackoverflow.com/questions/31683872/cassandra-group-by-and-filter-results

标签

group-by

cassandra

filtering

user-defined-functions

data-modeling