Querying with “contains” on a list of user defined type (UDT)

问题

For data model like:

create type city (
   name text,
   code int
);

create table user (
    id uuid,
    name text,
    cities list<FROZEN<city>>,
    primary key ( id )
);

create index user_city_index on user(cities);

Querying as

select id, cities from user where cities contains {name:'My City', code: 10};

is working fine. But is it possible to query

select id, cities from user where cities contains {name:'My City'};

and discard the code attribute, i.e. code=<any>?

Can this be achieved with the utilization of Spark?

回答1:

But is it possible to query: select id, cities from user where cities contains {name:'My City'};

No, it is not. The documentation on using a UDT states (for a UDT column name):

Filter data on a column of a user-defined type. Create an index and then run a conditional query. In Cassandra 2.1.x, you need to list all components of the name column in the WHERE clause.

So querying your cities UDT collection will require all components of the city type.

I'm sure there's a way to query this in Spark, but I'll give you a Cassandra based answer. Basically, create an additional list column defined/indexed just to hold the list of city names, and run your CONTAINS on that. Even better, would be to denormalize the city type into a query table (usersbycity) with a PRIMARY KEY definition like PRIMARY KEY(cityname, citycode, userid) and use that in addition to your user table to support queries by city name and code (or just city name).

Remember, Cassandra works best when the tables are specifically designed to suit your query patterns. Secondary indexes are meant for convenience, not performance. Trying to augment one table to support multiple queries is a RDBMs approach to data modeling (which typically doesn't work well in Cassandra). And instead of one table that serves one query well, you usually end up with one table that serves multiple queries poorly.

Edit for your questions:

1) "Is it acceptable to have long clustering keys?"

I cannot find a definitive statement on this at the moment, but I think the bigger issue here is in how clustering keys are stored/used "under the hood." Essentially, each clustering key value is appended to each column value (for quicker retrieval). Obviously, if you have a lot of them, that's going to eat disk space (not too big of a concern these days...if it is you can counter that with the COMPACT STORAGE directive).

If you have many of them, it may eventually impact performance. I can double-check on this one and get back to you. I wouldn't go with...say...100 clustering keys. But I don't think 10 is a big deal. I know that I've created models using 7 or 8, and they perform just fine.

2) "If there are other denormalized tables (like usersbyhobby, usersbybookread etc.) related to user, how can I combine filtering from these tables to filters from usersbycity into one query, since there is no JOINs in c*?"

You cannot combine them at query-time. What you can do, is if you find that you have a query that needs data from usersbyhobby, usersbybookread, and usersbycity all at once; is to create a denormalized table containing all of that data. Depending on your query needs, you may need to order the PRIMARY KEY different ways, in which case you would need to create as many tables as you have specific queries to serve.

The other alternative, would be to make individual queries and manage them client-side. Client-side JOINs are considered to be a Cassandra anti-pattern, so I would use that with caution. It all depends on the needs of your application, and whether you want to spend the majority of your time working on data modeling/management or in processing on the client side. Honestly, I prefer to keep the client side as simple as I can.

来源：https://stackoverflow.com/questions/29012661/querying-with-contains-on-a-list-of-user-defined-type-udt

标签

apache-spark

cassandra

nosql