问题
We have cassandra column family. each row have multiple columns. columns have name, but value is empty. if we have 5-10 row keys, how we can find column names that appear in all of these keys. e.g.
row1: php, programming, accounting
row2: php, bookkeeping, accounting
row3: php, accounting
must return:
result: php, accounting
note we can not easily load whole row into the memory, because it may contain 1M+ columns solution not need to be fast.
回答1:
In order to do intersection of several rows, we will need to intersect two of them first, then to intersect the result with third and so on.
Looks like in cassandra we can query the data by column names and this is relatively fast operation.
So we first get Column Slice of 10k rows. Making list of column names (in PHP Cassa - put them in array). Then select those from second row.
Code may be looking like this:
$x = $cf->get($first_key, <some column slice>);
$column_names = array();
foreach(array_keys($x) as $k)
$column_names[] = $k;
$result = $cf->get($second_key, $column_slice = null, $column_names);
// write result somewhere, and proceed with next slice
回答2:
You columns names are sorted and you can create an iterator for each row (this iterator load portion of date at once, for example 10k of columns). Now put each iterator into a priority queue (by the next column name). If you take for queue the k times the iterator with the same column names, this is common names between all rows, in the other case we move to the next element and return iterators to queue.
回答3:
You could use a Hadoop map/reduce job as follows:
Map output key = column name
Map output value = row key
Reducer counts row keys for each column and outputs column name & count to a CF with the following schema:
key : [column name] { Count : [count] }
You can then query counts from this CF in reverse order. The first record will be the max, so you can keep iterating until a value is < max. This will be your intersection.
来源:https://stackoverflow.com/questions/11749846/intersect-cassandra-rows