Pardon my new-ness to the R world, thank you kindly in advance for your help.
I would like to analyze the data from an experiment.
The data comes in in Long
Using data.table
you can do:
library(data.table)
> dcast(dt, User_id + location + age ~ Item, value.var = "Resp", fill = 0L)
User_id location age A B C D E G H
1: 1 CA 22 1 -1 -1 1 -1 0 0
2: 2 MD 27 -1 1 1 0 1 -1 -1
There’s a package called tidyr that makes melting and reshaping data frames much easier. In your case, you can use tidyr::spread
straightforwardly:
result = spread(df, Item, Resp)
This will however fill missing entries with NA
:
User_id location age gender A B C D E G H
1 1 CA 22 M 1 -1 -1 1 -1 NA NA
2 2 MD 27 F -1 1 1 NA 1 -1 -1
You can fix this by replacing them:
result[is.na(result)] = 0
result
# User_id location age gender A B C D E G H
# 1 1 CA 22 M 1 -1 -1 1 -1 0 0
# 2 2 MD 27 F -1 1 1 0 1 -1 -1
… or by using the fill
argument:
result = spread(df, Item, Resp, fill = 0)
For completeness’ sake, the other way round (i.e. reproducing the original data.frame
) works via gather
(this is usually known as “melting”):
gather(result, Item, Resp, A : H)
— The last argument here tells gather
which columns to gather (and it supports the concise range syntax).
Here's the always elegant stats::reshape
version
(newdf <- reshape(df, direction = "wide", timevar = "Item", idvar = names(df)[1:4]))
# User_id location age gender Resp. A Resp. B Resp. C Resp. D Resp. E Resp. G Resp. H
# 1 1 CA 22 M 1 -1 -1 1 -1 NA NA
# 6 2 MD 27 F -1 1 1 NA 1 -1 -1
Missing values get filled with NA
in reshape()
, and the names are not what we want. So we'll need to do a bit more work. Here we can change the names and replace the NA
s with zero in the same line to arrive at your desired result.
replace(setNames(newdf, sub(".* ", "", names(newdf))), is.na(newdf), 0)
# User_id location age gender A B C D E G H
# 1 1 CA 22 M 1 -1 -1 1 -1 0 0
# 6 2 MD 27 F -1 1 1 0 1 -1 -1
Of course, the code would definitely be more legible if we broke this up into two separate lines. Also, note that there is no F
in Item
in your original data, hence the difference in output from yours.
Data:
df <- structure(list(User_id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c(" CA", " MD"), class = "factor"), age = c(22L,
22L, 22L, 22L, 22L, 27L, 27L, 27L, 27L, 27L, 27L), gender = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" F", " M"
), class = "factor"), Item = structure(c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 5L, 6L, 7L), .Label = c(" A", " B", " C", " D", " E",
" G", " H"), class = "factor"), Resp = c(1, -1, -1, 1, -1, -1,
1, 1, 1, -1, -1)), .Names = c("User_id", "location", "age", "gender",
"Item", "Resp"), class = "data.frame", row.names = c(NA, -11L
))