how to covert character within each column as sub-column without duplication

问题

I have a data.frame file like this: input:

1 200 444 444
2 310 NA  444
3 310 NA  444
4 NA  444 444
5 200 444 444
6 200 NA  444
7 310 444 444 
8 310 876 444
9 310 876 444
10 NA  876 444

I want to convert ecah character within each column as a sub-column and I want to put either 1 or zero in rows in the way that they represent if the the sub column was observed in that specific row or not: Output data.frame :

   c1.200 c1.310 c2.444 c2.876 c3.444
1   1      0      1      0      1 
2   0      1      0      0      1
3   0      1      0      0      1
4   0      0      1      0      1
5   1      0      1      0      1
6   1      0      0      0      1
7   0      1      1      0      1
8   0      1      0      1      1
9   0      1      0      1      1
10  0      0      0      1      1

is there any solution in R to do this? Meanwhile, my real data had 117000 rows and 10,000 columns.

回答1:

We could do this using table from base R. We unlist the dataset, paste with the new column names that start with c, remove the NA elements using is.na, get the table with the sequence of rows and the paste vector.

nm1 <- paste0('c', 1:3, '.')[col(dat)]
v1 <- unlist(dat)
i1 <- !is.na(v1)
newdat <- as.data.frame.matrix(table((1:nrow(dat))[row(dat)][i1], 
                         paste0(nm1[i1], v1[i1])))
newdat
#     c1.200 c1.310 c2.444 c2.876 c3.444
#  1       1      0      1      0      1
#  2       0      1      0      0      1
#  3       0      1      0      0      1
#  4       0      0      1      0      1
#  5       1      0      1      0      1
#  6       1      0      0      0      1
#  7       0      1      1      0      1
#  8       0      1      0      1      1
#  9       0      1      0      1      1
#  10      0      0      0      1      1

回答2:

We can do this using dplyr and tidyr:

library(dplyr)
library(tidyr)
newdat <- dat %>% setNames(paste0("c", 1:ncol(.), ".")) %>%
        mutate(row = row_number(), n = 1) %>%
        gather(key, val, -row, -n) %>%
        na.omit %>%
        unite(keyval, key, val, sep = "") %>%
        spread(keyval, n, fill = 0)

   row c1.200 c1.310 c2.444 c2.876 c3.444
1    1      1      0      1      0      1
2    2      0      1      0      0      1
3    3      0      1      0      0      1
4    4      0      0      1      0      1
5    5      1      0      1      0      1
6    6      1      0      0      0      1
7    7      0      1      1      0      1
8    8      0      1      0      1      1
9    9      0      1      0      1      1
10  10      0      0      0      1      1

I used this dataset, as dat:

structure(list(V2 = c(200L, 310L, 310L, NA, 200L, 200L, 310L, 
310L, 310L, NA), V3 = c(444L, NA, NA, 444L, 444L, NA, 444L, 876L, 
876L, 876L), V4 = c(444L, 444L, 444L, 444L, 444L, 444L, 444L, 
444L, 444L, 444L)), .Names = c("V2", "V3", "V4"), class = "data.frame", row.names = c(NA, 
-10L))

To output, use write.csv(newdat, file="yourfilename.csv")

来源：https://stackoverflow.com/questions/32728396/how-to-covert-character-within-each-column-as-sub-column-without-duplication

标签

reshape

tidyr