Gather multiple sets of columns

后端 未结 5 948
耶瑟儿~
耶瑟儿~ 2020-11-22 03:01

I have data from an online survey where respondents go through a loop of questions 1-3 times. The survey software (Qualtrics) records this data in multiple columns—that is,

5条回答
  •  离开以前
    2020-11-22 03:24

    This could be done using reshape. It is possible with dplyr though.

      colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
      colnames(df)[2] <- "Date"
      res <- reshape(df, idvar=c("id", "Date"), varying=3:8, direction="long", sep="_")
      row.names(res) <- 1:nrow(res)
    
       head(res)
      #  id       Date time       Q3.2       Q3.3
      #1  1 2009-01-01    1  1.3709584  0.4554501
      #2  2 2009-01-02    1 -0.5646982  0.7048373
      #3  3 2009-01-03    1  0.3631284  1.0351035
      #4  4 2009-01-04    1  0.6328626 -0.6089264
      #5  5 2009-01-05    1  0.4042683  0.5049551
      #6  6 2009-01-06    1 -0.1061245 -1.7170087
    

    Or using dplyr

      library(tidyr)
      library(dplyr)
      colnames(df) <- gsub("\\.(.{2})$", "_\\1", colnames(df))
    
      df %>%
         gather(loop_number, "Q3", starts_with("Q3")) %>% 
         separate(loop_number,c("L1", "L2"), sep="_") %>% 
         spread(L1, Q3) %>%
         select(-L2) %>%
         head()
      #  id       time       Q3.2       Q3.3
      #1  1 2009-01-01  1.3709584  0.4554501
      #2  1 2009-01-01  1.3048697  0.2059986
      #3  1 2009-01-01 -0.3066386  0.3219253
      #4  2 2009-01-02 -0.5646982  0.7048373
      #5  2 2009-01-02  2.2866454 -0.3610573
      #6  2 2009-01-02 -1.7813084 -0.7838389
    

    Update

    With tidyr_0.8.3.9000, we can use pivot_longer to reshape multiple columns. (Using the changed column names from gsub above)

    library(dplyr)
    library(tidyr)
    df %>% 
        pivot_longer(cols = starts_with("Q3"), 
              names_to = c(".value", "Q3"), names_sep = "_") %>% 
        select(-Q3)
    # A tibble: 30 x 4
    #      id time         Q3.2    Q3.3
    #             
    # 1     1 2009-01-01  0.974  1.47  
    # 2     1 2009-01-01 -0.849 -0.513 
    # 3     1 2009-01-01  0.894  0.0442
    # 4     2 2009-01-02  2.04  -0.553 
    # 5     2 2009-01-02  0.694  0.0972
    # 6     2 2009-01-02 -1.11   1.85  
    # 7     3 2009-01-03  0.413  0.733 
    # 8     3 2009-01-03 -0.896 -0.271 
    #9     3 2009-01-03  0.509 -0.0512
    #10     4 2009-01-04  1.81   0.668 
    # … with 20 more rows
    

    NOTE: Values are different because there was no set seed in creating the input dataset

提交回复
热议问题