Removing duplicate rows from data.frame (with some details about column ordering)

问题

I have a large data.frame with 12 columns and a lot of rows but lets simplify

  Id A1  A2  B1  B2  Result
  1  55  23  62  12  1
  2  23  55  12  62  1                 * (dup of Id 1)
  3  23  6   2   62  1
  4  23  55  62  12  1                 * (dup of Id 1)
  5  21  62  55  23  0                 * (dup of Id 1)
  6 . . . 
  . .
  .   . 
  .     .

Now the ordering of the A's (A1, A2) and B's (B1, B2) does not matter. If they both have the same values eg (55,23) and (62,12) they are duplicates, no matter the ordering of A and B variables.

Furthermore if A_id_x = B_id_y and B_id_x = A_id_y and Result_id_x = 1 - Result_id_y we also have a duplicate.

How does one go about cleaning this frame of duplicates?

回答1:

For the first one I would create a new variable doing something like this:

tc= 'Id A1  A2  B1  B2  Result
  1  55  23  62  12  1
  2  23  55  12  62  1                  
  3  23  6   2   62  1
  4  23  55  62  12  1                  
  5  21  62  55  23  0'

df =read.table(textConnection(tc),header=T)
df$tmp = paste(apply(df[,2:3],1,min),apply(df[,2:3],1,max),sep='')
subset(df, !duplicated(tmp))

For the second part your notation is quite confusing, but maybe you can follow a similar procedure.

回答2:

How about this:

    tc= 'Id A1  A2  B1  B2  Result
      1  55  23  62  12  1
      2  23  55  12  62  1                  
      3  213  6   2   62  1
      4  23  55  62  12  1                  
      5  21  62  55  23  0'

    x <- read.table(textConnection(tc),header=T)

    a1b1 <- transform(x, combi="a1b1", a=A1, b=B1)
    a1b2 <- transform(x, combi="a1b2", a=A1, b=B2)
    a2b1 <- transform(x, combi="a2b1", a=A2, b=B1)
    a2b2 <- transform(x, combi="a2b2", a=A2, b=B2)

    x_long <- rbind(a1b1,a1b2,a2b1,a2b2)
    idx <- duplicated(x_long[,c("a", "b")])
    dup_ids <- unique(x_long[idx, "Id"])
    unique_ids <- setdiff(x_long$Id, dup_ids)

    x[unique_ids,]

Regarding the Result part, it is not clear to me what you mean.

回答3:

Check out the allelematch package. While this package is primarily intended for finding matching rows in a data.frame consisting of allelic genotype data, it will work on data of any source.

It may be of particular interest to you as you are working with a case where you need to move beyond the perfect matching functionality provided by duplicated(). allelematch handles missing data, and mismatching data (i.e. where not all elements of two row vectors match or are present). It returns candidate matches by identifying rows of the data frame that are most similar.

This may be more functionality than you need - it sounds as if your columns have been permuted in some consistent way (it is not exactly clear what this from your post). However, if identifying the consistent permutation is itself a challenge, then this empirical approach might help.

回答4:

I ended up using Excel VBA programming to solve the problem

This was the procedure:

Internally sort each A and each B for all of the rows
Then flip the positions of A and B of Result = 0 and change Result to 1
Remove duplicates

来源：https://stackoverflow.com/questions/8989073/removing-duplicate-rows-from-data-frame-with-some-details-about-column-ordering

标签

duplicates