问题
I've got a csv file that contain more than 60 columns and 2 000 000 lines, I'm trying to count the number of null value per variable (per column) then to do the sum of that new row to get the number total of null value in the entire csv. For example if we got this file in input:
We expect this other file in output:
I know how to count the number of null value per line but, I didn't figure out how to count the number of null value per column.
回答1:
There has to be a better way to do this, but I made a really nasty JavaScript which does the job.
It has some problems for different column types, as it doesn't set the column type. (It should set all columns to integer, but I don't know if that is possible from JavaScript.)
You have to run Identify last row in a stream
first, and save it to the column last
(or change the script).
var nulls;
var seen;
if (!seen) {
// Initialize array
seen = 1;
nulls = [];
for (var i = 0; i < getInputRowMeta().size(); i++) {
nulls[i] = 0;
}
}
for (var i = 0; i < getInputRowMeta().size(); i++) {
if (row[i] == null) {
nulls[i] += 1;
}
// Hack to find empty strings
else if (getInputRowMeta().getValueMeta(i).getType() == 2 && row[i].length() == 0) {
nulls[i] += 1;
}
}
// Don't store any values
trans_Status = SKIP_TRANSFORMATION;
// Only store the nulls at the last row
if (last == true) {
putRow(nulls);
}
回答2:
Please drag and drop below steps in to canvas.
step1: Add constants: create one variable called constant and value = 1
step2: Filter Rows: you have filter null values of all columns.
step3: Group by: here group by field constant variable aggregates section we have to specify remaining columns like ct_inc.And type is Number of Values (N)
If you have any doubts feel free to ask.
skype_id : panabakavenkatesh
来源:https://stackoverflow.com/questions/35368635/count-the-number-of-null-value-per-column-with-pentaho