问题
I have searched a lot and tried on my own too, but couldn't find solution for this particular problem.
For every 2 rows ('key' is common), I have to find mismatches in every column and highlight them in an organized way like below.
The output should be in the following format:
COLUMN_NAME is not matching for records below:
PRINT COMPLETE RECORDS
...
COLUMN_NAME is not matching for records below:
PRINT COMPLETE RECORDS
...
COLUMN_NAME is not matching for records below:
PRINT COMPLETE RECORDS
...
Input Data (it's a data frame):
key V1 V2 V3 V4 V5
a1 1 2 3 4 5
a1 1 3 9 4 5
a5 2 1 4 7 5
a5 2 1 4 7 6
a6 7 6 8 9 6
a6 7 6 3 9 6
a9 7 6 8 9 4
a9 7 6 8 9 3
Output:
V2 is not matching for records below:
key V1 V2 V3 V4 V5
a1 1 2 3 4 5
a1 1 3 9 4 5
V3 is not matching for records below:
key V1 V2 V3 V4 V5
a1 1 2 3 4 5
a1 1 3 9 4 5
a6 7 6 8 9 6
a6 7 6 3 9 6
V5 is not matching for records below:
key V1 V2 V3 V4 V5
a5 2 1 4 7 5
a5 2 1 4 7 6
a9 7 6 8 9 4
a9 7 6 8 9 3
I'm a beginner in R, so please be nice :)
回答1:
first split your data frame by key
:
dfs <- split(df, df$key) # presuming your data frame is named `df`
now write a function taking a data frame and comparing first and second row (for simplicity, we're not going to check whether the data frame actually has 2 rows - that's just taken for granted):
chk <- function(x) sapply(x, function(u) u[1]==u[2])
and now apply that function to the split
'ed data:
matches <- sapply(dfs,chk)
## so `foo` is a matrix showing, for each variable and each ID, whether there is
## a match or not
apply(matches, 1, function(x) colnames(matches)[which(!x)])
## and this one takes each row in `foo` and extracts the column name (i.e. key)
## for every TRUE-valued cell. the result is a list - note that some of the
## elements will be empty
The last row outputs the names (key
column) of the non-matching pairs of each variable.
And now the final step:
mm_keys <- apply(matches, 1, function(x) colnames(matches)[which(!x)])
# mm_keys stands for mismatching keys
lapply(mm_keys, function(x) subset(df, key %in% x))
# this one, called `mm_lines` below, takes each element from mm_keys
# .. and extracts (via `subset`) the corresponding lines from the original data frame
Ok by this you already have all information that you wanted but not formatted in a nice way. You can do that easily too.
mm_lines <- lapply(mm_keys, function(x) subset(df, key %in% x))
mm_lines <- mm_lines[sapply(mm_lines, nrow)>0]
# leave out variables where there is no mismatch
# for understanding this, try what `sapply(mm_lines, nrow)` does
# and add labels the way you want:
names(mm_lines) <- paste(names(mm_lines), "IS NOT MATCHING FOR RECORDS BELOW:")
Now the output:
print(boo)
#$`V2 IS NOT MATCHING FOR RECORDS BELOW:`
# key V1 V2 V3 V4 V5
#1 a1 1 2 3 4 5
#2 a1 1 3 9 4 5
#
#$`V3 IS NOT MATCHING FOR RECORDS BELOW:`
# key V1 V2 V3 V4 V5
#1 a1 1 2 3 4 5
#2 a1 1 3 9 4 5
#5 a6 7 6 8 9 6
#6 a6 7 6 3 9 6
#
#$`V5 IS NOT MATCHING FOR RECORDS BELOW:`
# key V1 V2 V3 V4 V5
#3 a5 2 1 4 7 5
#4 a5 2 1 4 7 6
#7 a9 7 6 8 9 4
#8 a9 7 6 8 9 3
[edit]
Since you asked for it, here is something that does it with on one line and looks a bit more like magick:
boo <- (function(x) x[sapply(x, nrow)>0])(lapply(lapply(df, function(x) tapply(x, df$key, function(x) x[1]!=x[2])), function(x) subset(df, key %in% names(which(x)))))
And for writing it to a text file ("out.txt") the way you wanted:
sink("out.txt")
for(iii in seq_along(boo)){
cat(names(boo)[iii], "IS NOT MATCHING FOR THE RECORDS BELOW:\n")
print(boo[[iii]])
cat("\n")
}
sink(NULL)
回答2:
You could try by
res <- c(with(stack(by(df[,-1], df[,1],
FUN=function(x)names(x)[ x[1,]!=x[2,]])),
by(ind, values, FUN=function(x) df[df[,1] %in% x,])))
names(res) <- paste(names(res), "is not matching for records below")
res
#$`V2 is not matching for records below`
# key V1 V2 V3 V4 V5
#1 a1 1 2 3 4 5
#2 a1 1 3 9 4 5
#$`V3 is not matching for records below`
# key V1 V2 V3 V4 V5
#1 a1 1 2 3 4 5
#2 a1 1 3 9 4 5
#5 a6 7 6 8 9 6
#6 a6 7 6 3 9 6
#$`V5 is not matching for records below`
# key V1 V2 V3 V4 V5
#3 a5 2 1 4 7 5
#4 a5 2 1 4 7 6
#7 a9 7 6 8 9 4
#8 a9 7 6 8 9 3
来源:https://stackoverflow.com/questions/25679815/compare-every-2-rows-and-show-mismatches-in-r