问题
I have a large data.frame with 12 columns and a lot of rows but lets simplify
Id A1 A2 B1 B2 Result
1 55 23 62 12 1
2 23 55 12 62 1 * (dup of Id 1)
3 23 6 2 62 1
4 23 55 62 12 1 * (dup of Id 1)
5 21 62 55 23 0 * (dup of Id 1)
6 . . .
. .
. .
. .
Now the ordering of the A's (A1, A2)
and B's (B1, B2)
does not matter. If they both have the same values eg (55,23)
and (62,12)
they are duplicates, no matter the ordering of A and B variables.
Furthermore if A_id_x = B_id_y
and B_id_x = A_id_y
and Result_id_x = 1 - Result_id_y
we also have a duplicate.
How does one go about cleaning this frame of duplicates?
回答1:
For the first one I would create a new variable doing something like this:
tc= 'Id A1 A2 B1 B2 Result
1 55 23 62 12 1
2 23 55 12 62 1
3 23 6 2 62 1
4 23 55 62 12 1
5 21 62 55 23 0'
df =read.table(textConnection(tc),header=T)
df$tmp = paste(apply(df[,2:3],1,min),apply(df[,2:3],1,max),sep='')
subset(df, !duplicated(tmp))
For the second part your notation is quite confusing, but maybe you can follow a similar procedure.
回答2:
How about this:
tc= 'Id A1 A2 B1 B2 Result
1 55 23 62 12 1
2 23 55 12 62 1
3 213 6 2 62 1
4 23 55 62 12 1
5 21 62 55 23 0'
x <- read.table(textConnection(tc),header=T)
a1b1 <- transform(x, combi="a1b1", a=A1, b=B1)
a1b2 <- transform(x, combi="a1b2", a=A1, b=B2)
a2b1 <- transform(x, combi="a2b1", a=A2, b=B1)
a2b2 <- transform(x, combi="a2b2", a=A2, b=B2)
x_long <- rbind(a1b1,a1b2,a2b1,a2b2)
idx <- duplicated(x_long[,c("a", "b")])
dup_ids <- unique(x_long[idx, "Id"])
unique_ids <- setdiff(x_long$Id, dup_ids)
x[unique_ids,]
Regarding the Result
part, it is not clear to me what you mean.
回答3:
Check out the allelematch
package. While this package is primarily intended for finding matching rows in a data.frame
consisting of allelic genotype data, it will work on data of any source.
It may be of particular interest to you as you are working with a case where you need to move beyond the perfect matching functionality provided by duplicated()
. allelematch handles missing data, and mismatching data (i.e. where not all elements of two row vectors match or are present). It returns candidate matches by identifying rows of the data frame that are most similar.
This may be more functionality than you need - it sounds as if your columns have been permuted in some consistent way (it is not exactly clear what this from your post). However, if identifying the consistent permutation is itself a challenge, then this empirical approach might help.
回答4:
I ended up using Excel VBA programming to solve the problem
This was the procedure:
Internally sort each A and each B for all of the rows
Then flip the positions of A and B of Result = 0 and change Result to 1
Remove duplicates
来源:https://stackoverflow.com/questions/8989073/removing-duplicate-rows-from-data-frame-with-some-details-about-column-ordering