Match combinations of row values between 2 different data frames

问题

I have a data.frame with 16 different combinations of 4 different cell markers

combinations_df

     FITC Cy3 TX_RED Cy5
 a    0   0      0   0
 b    1   0      0   0
 c    0   1      0   0
 d    1   1      0   0
 e    0   0      1   0
 f    1   0      1   0
 g    0   1      1   0
 h    1   1      1   0
 i    0   0      0   1
 j    1   0      0   1
 k    0   1      0   1
 l    1   1      0   1
 m    0   0      1   1
 n    1   0      1   1
 o    0   1      1   1
 p    1   1      1   1

I have my "main" data.frame with 10 columns and thousands of rows.

> main_df
  a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1    1 1   1 0      1 1   1 1
2 0 1    0 1   1 0      1 0   1 1
3 1 1    0 0   0 1      1 0   0 0
4 0 1    1 1   1 0      1 1   1 1
5 0 0    0 0   0 0      0 0   0 0
....

I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.

sample output

> phenotype
[1] "g" "i" "a" "p" "g"

I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.

Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.

EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.

EDIT: Changin the sample data output, since no "t" should be present

回答1:

The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:

#   phenotype FITC Cy3 TX_RED Cy5
#1          a    0   0      0   0
#2          b    1   0      0   0
#3          c    0   1      0   0
#4          d    1   1      0   0
# etc

dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.

library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))

#  a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1    1 1   1 0      1 1   1 1         p
#2 0 1    0 1   1 0      1 0   1 1         o
#3 1 1    0 0   0 1      1 0   0 0         e
#4 0 1    1 1   1 0      1 1   1 1         p
#5 0 0    0 0   0 0      0 0   0 0         a

I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.

回答2:

Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.

            main_text=NULL
            for(i in 1:length(main_df[,1])){
            main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
            }
            comb_text=NULL
            for(i in 1:length(combinations_df[,1])){
            comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
            }

            rownames(combinations_df)[match(main_text,comb_text)]

回答3:

How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.

combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)

main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)

rownames(combination_df)[match(main_df$key, combination_df$key)]

来源：https://stackoverflow.com/questions/40184661/match-combinations-of-row-values-between-2-different-data-frames

标签

loops

dataframe

pattern-matching