问题
I have a table that is subject to heavy insert and delete action, and I need to scan it frequently with Scans (only by row-key, no column values).
I noticed that Scan latency increases as the amount of data in the table grows. After closer inspection of ScanMetrics, I noticed that for most higher-latency scans, the measure of ScanMetrics.countOfRowsFiltered is MUCH higher than the number of rows that I'm actually requesting to scan (which I specify both .setLimit() in the Scan and PageFilter() in the FilterList that I set to the scan).
What exactly does the measure of countOfRowsFiltered represent? In my testing environments, I can never reproduce the situation that the number of rows scanned is higher than what I set as a limit, and consequently this countOfRowsFiltered is always zero. But in the real environment it is frequently quite high (and according to my calculations, this may be the reason for the gradual increase in the overall scan latency).
I can't find any description of this measure out there. Any experience with it, and how to minimize it?
I set up my scan as follows:
Scan scan = new Scan().withStartRow(rowKeyStart).withStopRow(rowKeyStop);
scan.setCaching(scanCache);
FilterList filterList = new FilterList(
FilterList.Operator.MUST_PASS_ALL,
new FirstKeyOnlyFilter(),
new KeyOnlyFilter(),
new PrefixFilter(myPrefix),
new PageFilter(limit));
scan.setFilter(filterList);
scan.setCacheBlocks(false);
scan.setLimit(limit);
scan.setReadType(ReadType.PREAD);
scan.setScanMetricsEnabled(true);
ResultScanner scanner = myTable.getScanner(m_scan);
int processed = 0;
for (Result row : m_scanner.next(limit))
{
// do something with this row
if (++processed >= limit)
break;
}
ScanMetrics sm = m_scanner.getScanMetrics();
long scanned = sm.countOfRowsScanned.get();
long filtered = sm.countOfRowsFiltered.get(); // WHAT IS THIS???
scanner.close();
回答1:
I believe I have found the answer:
I was performing Deletes by specifying only the rowKey (even though I only have one column in the row). In this case, a delete marker is put on the row and the row is excluded from all scans and gets, BUT it remains physically present in the underlying infrastructure even after major compactions. This way the Scan spends extra time iterating through those deleted rows and filtering them out to prepare the final result that excludes them.
It looks like the row would only get removed from the underlying infrastructure if the Delete was fully qualified by the RowKey, ColumnFamily, ColumnName, AND TimeStamp of ALL of its columns.
FURTHERMORE: it seems that it's not sufficient to just do the Major Compaction. First the table needs to be Flushed, and THEN major-compacted, and only then the deleted rows are fully gone and the Scan doesn't spend extra time filtering them out.
This is harder than I thought...
来源:https://stackoverflow.com/questions/52318310/what-exactly-is-countofrowsfiltered-in-scanmetrics-with-hbase-scan