问题
I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:
> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
var1 var2 var3
1 a right.a correct.a
2 a <NA> correct.a
3 b right.b correct.b
4 b <NA> correct.b
5 c right.c correct.c
6 c <NA> correct.c
7 d right.d correct.d
8 d <NA> correct.d
"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:
var1 var2 var3
1 a right.a correct.a
2 b right.b correct.b
3 c right.c correct.c
4 d right.d correct.d
However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
var1 var2 var3
1 e <NA> correct.e
2 e <NA> correct.e
3 e right.e <NA>
4 f <NA> correct.f
5 f <NA> <NA>
6 g <NA> correct.g
7 g right.g wrong.g
8 g <NA> <NA>
And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.
What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?
Thanks!
回答1:
Here is a method using data.table
library(data.table)
setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
var1 V1 V2
1: a right.a correct.a
2: b right.b correct.b
3: c right.c correct.c
4: d right.d correct.d
and
setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
var1 V1 V2
1: e right.e correct.e
2: f NA correct.f
3: g right.g correct.g
The idea in var2[!is.na(var2)][1]
for example, to take the first non-missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables by var1.
If you have more than two variables, you might switch to lapply
. For example, the following.
df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
var1 var2 var3
1: a right.a correct.a
2: b right.b correct.b
3: c right.c correct.c
4: d right.d correct.d
In an instance where more than one var1 has a valid value, and this is indicated by a non-missing var2, then you can reach the intended result with a join.
The data from the comment,
df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))
Then, with this data,
setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
var1 var2 var3
1: a right.a correct.a
2: a right.a1 correct.a1
3: b right.b correct.b
4: c right.c correct.c
5: d right.d correct.d
Here, all non-missing var2 observations by var1 are merged onto the original data set.
回答2:
If var2
and var3
have only one unique value for each level of var1
, then:
library(dplyr)
df = rbind(df1,df2)
df %>% group_by(var1) %>%
summarise_all(funs(.[!is.na(.)][1]))
var1 var2 var3 1 a right.a correct.a 2 b right.b correct.b 3 c right.c correct.c 4 d right.d correct.d 5 e right.e correct.e 6 f <NA> correct.f 7 g right.g correct.g
来源:https://stackoverflow.com/questions/43100583/aggregating-based-on-near-row-values