Using dplyr to create new dataframe depending on thresholds

问题

   Groups Names COL1  COL2  COL3        COL4
1      G1   SP1    1 0.400 0.500   Sequence1
2      G1   SP1    1 0.004 0.005   Sequence2
3      G1   SP1    0 0.004 0.005   Sequence3
4      G1   SP2    0 0.400 0.005 Sequence123
5      G1   SP2    0 0.004 0.500  Sequence14
6      G1   SP3    0 0.005 0.006  Sequence15
7      G1   SP5    1 0.400 0.006  Sequence16
8      G1   SP6    1 0.008 0.002  Sequence20
10     G2   Sp1    0 0.004 0.005  Sequence17
11     G2   SP1    0 0.050 0.600  Sequence18
12     G2   SP1    0 0.400 0.600   Sequence3
13     G2   SP2    0 0.004 0.005  Sequence22
14     G2   SP2    0 0.004 0.005  Sequence23
15     G2   SP5    0 0.004 0.005  Sequence16
16     G2   SP6    0 0.003 0.002  Sequence21
17     G2   SP7    0 0.560 0.760  Sequence67

Here is the dput:

dput(test_df)
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), 
    Names = structure(c(2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, 1L, 2L, 
    2L, 3L, 3L, 5L, 6L, 7L), .Label = c("Sp1", "SP1", "SP2", 
    "SP3", "SP5", "SP6", "SP7"), class = "factor"), COL1 = c(1L, 
    1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    ), COL2 = c(0.4, 0.004, 0.004, 0.4, 0.004, 0.005, 0.4, 0.008, 
    0.004, 0.05, 0.4, 0.004, 0.004, 0.004, 0.003, 0.56), COL3 = c(0.5, 
    0.005, 0.005, 0.005, 0.5, 0.006, 0.006, 0.002, 0.005, 0.6, 
    0.6, 0.005, 0.005, 0.005, 0.002, 0.76), COL4 = structure(c(1L, 
    8L, 13L, 2L, 3L, 4L, 5L, 9L, 6L, 7L, 13L, 11L, 12L, 5L, 10L, 
    14L), .Label = c("Sequence1", "Sequence123", "Sequence14", 
    "Sequence15", "Sequence16", "Sequence17", "Sequence18", "Sequence2", 
    "Sequence20", "Sequence21", "Sequence22", "Sequence23", "Sequence3", 
    "Sequence67"), class = "factor")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "10", "11", "12", "13", "14", 
"15", "16", "17"))
and from this dataf

rame I whant to get another dataframe such as :

    G1  G2
SP1 A   B
SP2 x   x
SP3 x   NA
SP4 NA  NA
SP5 A   X
SP6 a x
SP7 NA b

The idea is for each Groups to add the Names that are present in the row and add letters A,B, X or NA in the cells and the upper or lowercases will depend if we find an identic COL4 value for at least one species in anotger Groups.

A is put when any row for the Name have a COL1 >0 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
a is put when any row for the Name have a COL1 >0 AND there is no Names that have the same COL4 content as the same Name in a different Groups
B is put when any row for the Name have a COL1=0 AND COL2 AND COL3 > 0.05 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
b is put when any row for the Name have a COL1=0 AND COL2 AND COL3 > 0.05 AND there is no Names that have the same COL4 content as the same Name in a different Groups
X is put when all row for the Name have a COL1=0 AND COL2``OR COL3 > 0.05 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
x is put when all row for the Name have a COL1=0 AND COL2 OR COL3 > 0.05 AND there is no Names that have the same COL4 content as the same Name in a different Groups
NA is put when there is not the Name in the Group

let's take 4 examples :

1) We see for the G1-SP1 that the row1 has a COL1 > 0, then it will have a letter A or a in the new dataframe. Now in order to know if it will be an A or an a we have to look at the COL4, we see in the row12 the Sequence3 is also present in the G2 for the SP1, so it will be an 'A'

2) We see for the G2-SP1 that the row12 has a COL2 and COL3 are > 0.05, then it will have a letter B or b in the new dataframe. And it will be B because in the G1, row3 the Sequence3 is also present in the G2 for the SP1.

3) We see for the G2-SP2 that none row has a COL1 >0X or COL2 and COL3 are > 0.05, then it will have a letter B or x in the new dataframe. And it will be x because none other SP2 in other Groups have the same Sequence `(Sequence22,Sequence23 or Sequence24)

4) We see for the G1-SP6 the row8 has a COL1 > 0, then it will have a letter A or a in the new dataframe. And it will be a because none other SP1 in other Groups have the same Sequence (Sequence20)

For that I tried:

Env_table<-as.data.frame(test_df) %>%
  group_by(Groups,Names) %>%
  mutate(Env_variable = replace_na(COL1, "."),
         Env_variable = ifelse(any(COL1 >=1) , "A", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 >0.05) && all(COL3 >0.05) , "B", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 <0.05) && all(COL3 <0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 <0.05) && all(COL3 >0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 >0.05) && all(COL3 <0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0) && all(!is.na(COL1)) && all(COL2 >0.05) && all(COL3 >0.05) , "*", Env_variable))%>%
  slice(1) %>%
  pivot_wider(id_col = Names, names_from = Groups, values_from = Env_variable) %>%
  arrange(as.integer(str_extract(Names, "\\d+")))

where Env_variable is juste an empty column that will store the A,B,X or NA values.

Thanks fo your help

回答1:

Your question is not crystal-clear, but here is an attempt to answer:

test_df %>% 
  group_by(Groups, Names) %>% 
  summarise(
    x=case_when(
      any(COL1>=1, na.rm=TRUE) ~ "A",
      any(COL1==0 & (COL2>0.05 & COL3>0.05), na.rm=TRUE) ~ "B",
      any(COL1==0 & (COL2<0.05 | COL3<0.05), na.rm=TRUE) ~ "X",
      TRUE ~ NA_character_
    )
  ) %>% 
  pivot_wider(names_from = Groups, values_from = x)

This would give this output:

  Names G1    G2   
  <fct> <chr> <chr>
1 SP1   A     B    
2 SP2   X     X    
3 SP3   X     NA   
4 SP5   A     X    
5 SP6   A     X    
6 SP1   NA    X    
7 SP7   NA    B

来源：https://stackoverflow.com/questions/61595711/using-dplyr-to-create-new-dataframe-depending-on-thresholds

标签

dataframe

dplyr