问题
I am working on a raw dataset that looks something like this:
df <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", "Alpha",
"Beta","Beta", "Beta","Beta" ),
"treatment"= LETTERS[seq(from = 1, to = 8)],
"Year" = c(1970, 1970, 1980, 1990, 1970, 1980,
1980,1990),
"Val" = c(0,0,0,1,0,1,0,1),
"Val2" = c(0,2.34,1.3,0,0,2.34,3.2,1.3))
The data is a bit dirty as I have multiple observations for each ID and Year identifier - e.g. I have 2 different rows for Alpha in 1970. The same holds for Beta in 1980.
The issue is that the variable of interest Val
Val2
have different scores in the duplicated rows (in terms of id/year).
I would like to find a concise way to produce the following final dataframe:
final <- data.frame("ID" = c("Alpha", "Alpha", "Alpha",
"Beta", "Beta","Beta" ),
"treatment"= c("B","C","D","E","G","H"),
"Year" = c(1970, 1980, 1990, 1970,
1980,1990),
"Val" = c(0,0,1,0,0,1),
"Val2" = c(2.34,1.3,0,0,3.2,1.3),
"del_treat" = c("A",NA,NA,NA,"F",NA),
"del_Val"=c(0,NA,NA,NA,1,NA),
"del_Val2"=c(0,NA,NA,NA,2.34,NA))
The logic is the following:
1) I want to have only one obs for every ID/year
2) I want only to retain the observation with a higher value in the Val2
category.
3) I would like to store the deleted rows values into separate columns to keep track of what I am deleting del_treat
, del_Val
and del_Val2
.
To illustrate. In df there is a duplicated observation for Alpha/1970. I want to reduce this to a single row. Val2 takes the value of 0 and 2.34, and in the final data frame, only 2.34 is retained. However, the values of treatment A are reported in newly created columns del_treat
, del_Val
and del_Val2
.
I am able to select rows based on the Val2``setDT(df)[order(-Val2)][,.SD[1,], by = .(ID, Year)]
value, but I would like to find a concise way to also 'store' the results deleted into the new columns
回答1:
Using data.table, a dcast based on rowid(ID, Year)
after ordering by Val2
descending gets you there with the exception of column names. The "_1" columns are the "keep" columns, and the "_2" columns are the "del" columns.
library(data.table)
setDT(df)
setorder(df, ID, Year, -Val2)
out <-
dcast(df, ID + Year ~ rowid(ID, Year), value.var = c('treatment', 'Val', 'Val2'))
out
# ID Year treatment_1 treatment_2 Val_1 Val_2 Val2_1 Val2_2
# 1: Alpha 1970 B A 0 0 2.34 0.00
# 2: Alpha 1980 C <NA> 0 NA 1.30 NA
# 3: Alpha 1990 D <NA> 1 NA 0.00 NA
# 4: Beta 1970 E <NA> 0 NA 0.00 NA
# 5: Beta 1980 G F 0 1 3.20 2.34
# 6: Beta 1990 H <NA> 1 NA 1.30 NA
We can change the names to match yours, only difference is the del columns have a number at the end. Would be useful if there is possiblity of > 2 rows per group.
setnames(out, function(x) gsub('(.*)_1', '\\1', x))
setnames(out, function(x) gsub('(.*_\\d+)', 'del_\\1', x))
out
# ID Year treatment del_treatment_2 Val del_Val_2 Val2 del_Val2_2
# 1: Alpha 1970 B A 0 0 2.34 0.00
# 2: Alpha 1980 C <NA> 0 NA 1.30 NA
# 3: Alpha 1990 D <NA> 1 NA 0.00 NA
# 4: Beta 1970 E <NA> 0 NA 0.00 NA
# 5: Beta 1980 G F 0 1 3.20 2.34
# 6: Beta 1990 H <NA> 1 NA 1.30 NA
回答2:
Here is one option with dplyr
. After grouping by 'ID', 'Year', create a logical column ('ind') that checks the max
of 'Val2', using that create two columns corresponding to 'Val' with 'del' as prefix for those values that are eliminated, as well as the 'treatment' not present, filter
the rows based on 'ind' and ungroup
library(dplyr)
df %>%
group_by(ID, Year) %>%
mutate(ind = Val2 == max(Val2) & !is.na(Val2)) %>%
mutate_at(vars(matches('Val')),
list(del = ~ if(any(!ind)) .[!ind] else NA_real_)) %>%
mutate(del_treat = if(any(!ind)) treatment[!ind] else NA_character_) %>%
filter(ind) %>%
ungroup %>%
select(-ind)
来源:https://stackoverflow.com/questions/59077618/duplicated-rows-select-rows-based-on-criteria-and-store-duplicated-values