reduce row to unique items

I have the dataframe

test <- structure(list(
     y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
     y2003 = c("freshman","junior","junior","sophomore","sophomore","senior"),
     y2004 = c("junior","sophomore","sophomore","senior","senior",NA),
     y2005 = c("senior","senior","senior",NA, NA, NA)), 
              .Names = c("2002","2003","2004","2005"),
              row.names = c(c(1:6)),
              class = "data.frame")
> test
       2002      2003      2004   2005
1  freshman  freshman    junior senior
2  freshman    junior sophomore senior
3  freshman    junior sophomore senior
4 sophomore sophomore    senior   <NA>
5 sophomore sophomore    senior   <NA>
6    senior    senior      <NA>   <NA>

And I would like to munge the data to get the individual steps only for each row, as in

result <- structure(list(
 y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"),
 y2003 = c("junior","junior","junior","senior","senior",NA),
 y2004 = c("senior","sophomore","sophomore",NA,NA,NA),
 y2005 = c(NA,"senior","senior",NA, NA, NA)), 
               .Names = c("1","2","3","4"),
               row.names = c(c(1:6)),
               class = "data.frame")

> result
          1      2         3      4
1  freshman junior    senior   <NA>
2  freshman junior sophomore senior
3  freshman junior sophomore senior
4 sophomore senior      <NA>   <NA>
5 sophomore senior      <NA>   <NA>
6    senior   <NA>      <NA>   <NA>

I know that if I treated each row as a vector, I could do something like

careerrow <- c(1,2,3,3,4)
pairz <- lapply(careerrow,function(i){c(careerrow[i],careerrow[i+1])})
uniquepairz <- careerrow[sapply(pairz,function(x){x[1]!=x[2]})]

My difficulty is to apply that row-wise to my data table. I assume lapply is the way to go, but so far I am unable to solve this one.

If your aim is to calculate the total number of each pathway

You could use something like this (using data.table because of the nice way it handles lists as elements within a data.table (data.frame-like) object.

I am using !duplicated(...) to remove the duplicates as this is slightly more efficient than unique.

library(data.table)
library(reshape2)
# make the rownames a column 
test$id <- rownames(test)
# put in long format
DT <- as.data.table(melt(test,id='id'))
# get the unique steps and concatenate into a unique identifier for each pathway
DL <- DT[!is.na(value), {.steps <- value[!duplicated(value)]
  stepid <- paste(.steps, sep ='.',collapse = '.')
  list(steps = list(.steps), stepid =stepid)}, by=id]
##    id                            steps                           stepid
## 1:  1           freshman,junior,senior           freshman.junior.senior
## 2:  2 freshman,junior,sophomore,senior freshman.junior.sophomore.senior
## 3:  3 freshman,junior,sophomore,senior freshman.junior.sophomore.senior
## 4:  4                 sophomore,senior                 sophomore.senior
## 5:  5                 sophomore,senior                 sophomore.senior
## 6:  6                           senior                           senior

# count the number per path

DL[, .N, by = stepid]
##                              stepid N
## 1:           freshman.junior.senior 1
## 2: freshman.junior.sophomore.senior 2
## 3:                 sophomore.senior 2
## 4:                           senior 1

lapply, when passed a data.frame, operates on its columns. That's because a data.frame is a list whose elements are the columns. Instead of lapply, you can use apply with MARGIN=1:

unique.padded <- function(x) {
   uniq <- unique(x)
   out  <- c(uniq, rep(NA, length(x) - length(uniq)))
}

t(apply(test, 1, unique.padded))

#   [,1]        [,2]     [,3]        [,4]    
# 1 "freshman"  "junior" "senior"    NA      
# 2 "freshman"  "junior" "sophomore" "senior"
# 3 "freshman"  "junior" "sophomore" "senior"
# 4 "sophomore" "senior" NA          NA      
# 5 "sophomore" "senior" NA          NA      
# 6 "senior"    NA       NA          NA

Edit: I saw your comment about your final goal. I would do something like this:

table(sapply(apply(test, 1, function(x)unique(na.omit(x))),
             paste, collapse = "_"))

#           freshman_junior_senior freshman_junior_sophomore_senior 
#                                1                                2 
#                           senior                 sophomore_senior 
#                                1                                2

来源：https://stackoverflow.com/questions/12417467/reduce-row-to-unique-items

标签

dataframe

data.table

lapply