Tidy data.frame with repeated column names

倾然丶 夕夏残阳落幕 提交于 2019-11-26 22:01:22

问题


I have a program that gives me data in this format

toy
                file_path Condition Trial.Num A B  C  ID A B  C   ID  A B  C    ID
1     root/some.extension  Baseline         1 2 3  5 car 2 1  7 bike  4 9  0 plane
2    root/thing.extension  Baseline         2 3 6 45 car 5 4  4 bike  9 5  4 plane
3     root/else.extension  Baseline         3 4 4  6 car 7 5  4 bike 68 7 56 plane
4 root/uniquely.extension Treatment         1 5 3  7 car 1 7 37 bike  9 8  7 plane
5  root/defined.extension Treatment         2 6 7  3 car 4 6  8 bike  9 0  8 plane

My goal is to tidy the format into something that at least can be easier to finally tidy with reshape having unique column names

tidy_toy
                 file_path Condition Trial.Num  A B  C    ID
1      root/some.extension  Baseline         1  2 3  5   car
2     root/thing.extension  Baseline         2  3 6 45   car
3      root/else.extension  Baseline         3  4 4  6   car
4  root/uniquely.extension Treatment         1  5 3  7   car
5   root/defined.extension Treatment         2  6 7  3   car
6      root/some.extension  Baseline         1  2 1  7  bike
7     root/thing.extension  Baseline         2  5 4  4  bike
8      root/else.extension  Baseline         3  7 5  4  bike
9  root/uniquely.extension Treatment         1  1 7 37  bike
10  root/defined.extension Treatment         2  4 6  8  bike
11     root/some.extension  Baseline         1  4 9  0 plane
12    root/thing.extension  Baseline         2  9 5  4 plane
13     root/else.extension  Baseline         3 68 7 56 plane
14 root/uniquely.extension Treatment         1  9 8  7 plane
15  root/defined.extension Treatment         2  9 0  8 plane

If I try to melt from toy it doesn't work because only the first ID column will get used for id.vars (hence everything will get tagged as cars). Identical variables will get dropped.

Here's the dput of both tables

   structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
    "root/else.extension", "root/some.extension", "root/thing.extension", 
    "root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
    1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"), 
        Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L, 
        4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L, 
        1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L, 
        5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L, 
        4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"), 
        A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L, 
        4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path", 
    "Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C", 
    "ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
    -5L))


structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L, 
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
"root/else.extension", "root/some.extension", "root/thing.extension", 
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline", 
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L, 
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L, 
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L, 
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L, 
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike", 
"car", "plane"), class = "factor")), .Names = c("file_path", 
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
-15L))

回答1:


You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:

# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])

# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A','^B','^C','^ID'), 
                 value.name = c('A','B','C','ID'))

which gives:

 > tidy_toy
                  file_path condition Trial.Num variable  A B  C    ID
 1:     root/some.extension  Baseline         1        1  2 3  5   car
 2:    root/thing.extension  Baseline         2        1  3 6 45   car
 3:     root/else.extension  Baseline         3        1  4 4  6   car
 4: root/uniquely.extension Treatment         1        1  5 3  7   car
 5:  root/defined.extension Treatment         2        1  6 7  3   car
 6:     root/some.extension  Baseline         1        2  2 1  7  bike
 7:    root/thing.extension  Baseline         2        2  5 4  4  bike
 8:     root/else.extension  Baseline         3        2  7 5  4  bike
 9: root/uniquely.extension Treatment         1        2  1 7 37  bike
10:  root/defined.extension Treatment         2        2  4 6  8  bike
11:     root/some.extension  Baseline         1        3  4 9  0 plane
12:    root/thing.extension  Baseline         2        3  9 5  4 plane
13:     root/else.extension  Baseline         3        3 68 7 56 plane
14: root/uniquely.extension Treatment         1        3  9 8  7 plane
15:  root/defined.extension Treatment         2        3  9 0  8 plane

Another option is to use a list of column-indexes for measure.vars:

tidy_toy <- melt(setDT(toy), 
                 measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)), 
                 value.name = c('A','B','C','ID'))

Making the column-names unique isn't necessary then.


A more complicated method that creates names that are better distinguishable by the patterns argument:

# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]

# make them unique
names(toy)[idx] <- paste(nms, 
                         rep(seq(length(nms) / length(names(tt)[tt > 1])), 
                             each = length(names(tt)[tt > 1])), 
                         sep = '.')

# your columnnames are now unique:
> names(toy)
 [1] "file_path" "Condition" "Trial.Num" "A.1"       "B.1"       "C.1"       "ID.1"      "A.2"      
 [9] "B.2"       "C.2"       "ID.2"      "A.3"       "B.3"       "C.3"       "ID.3"     

# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'), 
                 value.name = c('A','B','C','ID'))

which will give the same end-result.


As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.




回答2:


with tidyverse we can do :

library(tidyverse)
toy %>%
  repair_names(sep="_") %>%
  pivot_longer(-(1:3),names_to = c(".value","id"), names_sep="_") %>%
  select(-id)

#> # A tibble: 15 x 7
#>    file_path               Condition Trial.Num     A     B     C ID   
#>    <fct>                   <fct>         <int> <int> <int> <int> <fct>
#>  1 root/some.extension     Baseline          1     2     3     5 car  
#>  2 root/some.extension     Baseline          1     2     1     7 bike 
#>  3 root/some.extension     Baseline          1     4     9     0 plane
#>  4 root/thing.extension    Baseline          2     3     6    45 car  
#>  5 root/thing.extension    Baseline          2     5     4     4 bike 
#>  6 root/thing.extension    Baseline          2     9     5     4 plane
#>  7 root/else.extension     Baseline          3     4     4     6 car  
#>  8 root/else.extension     Baseline          3     7     5     4 bike 
#>  9 root/else.extension     Baseline          3    68     7    56 plane
#> 10 root/uniquely.extension Treatment         1     5     3     7 car  
#> 11 root/uniquely.extension Treatment         1     1     7    37 bike 
#> 12 root/uniquely.extension Treatment         1     9     8     7 plane
#> 13 root/defined.extension  Treatment         2     6     7     3 car  
#> 14 root/defined.extension  Treatment         2     4     6     8 bike 
#> 15 root/defined.extension  Treatment         2     9     0     8 plane
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 2, 3, 4].


来源:https://stackoverflow.com/questions/41728439/tidy-data-frame-with-repeated-column-names

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!