I want to know how many items are in my dynamodb table. From the API guide, one way to do it is using a scan as follows:
Here's how I get the exact item count on my billion records DynamoDB table:
hive>
set dynamodb.throughput.write.percent = 1;
set dynamodb.throughput.read.percent = 1;
set hive.execution.engine = mr;
set mapreduce.reduce.speculative=false;
set mapreduce.map.speculative=false;
CREATE EXTERNAL TABLE dynamodb_table (`ID` STRING,`DateTime` STRING,`ReportedbyName` STRING,`ReportedbySurName` STRING,`Company` STRING,`Position` STRING,`Country` STRING,`MailDomain` STRING) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "BillionData", "dynamodb.column.mapping" = "ID:ID,DateTime:DateTime,ReportedbyName:ReportedbyName,ReportedbySurName:ReportedbySurName,Company:Company,Position:Position,Country:Country,MailDomain:MailDomain");
SELECT count(*) FROM dynamodb_table;
*You should have a EMR cluster, which comes installed with Hive and DynamoDB record Handler.
*With this command, DynamoDB handler on the hive issues "PARALLEL SCANS" with multiple Mapreduce mappers(AKA Workers) working on different partitions to get the count. This will be much efficient and faster than normal scans.
*You must be willing to bump up Read capacity very high for certain period of time.
* On a decent sized(20 node) cluster , With 10000 RCU , it took 15 minutes to get count on billion records Approx.
* New writes on this DDB table during this period will make the count inconsistent.