Aggregating based on “near” row values

僤鯓⒐⒋嵵緔 提交于 2019-12-12 17:13:10

问题


I have a very messy dataframe (webscraped) that unfortunately has many double and even triple entries in it. Most of the dataframe looks like this:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d

"var1" is my ID variable that I need to use to aggregate. My goal is to have a dataframe that looks like this:

  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d

However, the main problem is, that not the whole dataframe looks like this. In fact, I have other parts that look like this:

> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>

And other variations. In the end, every ID should have one row with the correct and right var2 and var3 in it. At this point, I get lost: My var1 is not unique. However, I know that duplicate IDs that "belong" together are grouped in the dataframe (as seen in my examples); e.g. there might be another "a" in row 4102 and 4103.

What I think would be the way to go is using aggregate with var1 as ID,but in addition telling R that aggregate should just check +-2 rows of var1 when doing so. Any ideas how to code this?

Thanks!


回答1:


Here is a method using data.table

library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

and

setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    e right.e correct.e
2:    f      NA correct.f
3:    g right.g correct.g

The idea in var2[!is.na(var2)][1] for example, to take the first non-missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables by var1.

If you have more than two variables, you might switch to lapply. For example, the following.

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

In an instance where more than one var1 has a valid value, and this is indicated by a non-missing var2, then you can reach the intended result with a join.

The data from the comment,

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

Then, with this data,

setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
   var1     var2       var3
1:    a  right.a  correct.a
2:    a right.a1 correct.a1
3:    b  right.b  correct.b
4:    c  right.c  correct.c
5:    d  right.d  correct.d

Here, all non-missing var2 observations by var1 are merged onto the original data set.




回答2:


If var2 and var3 have only one unique value for each level of var1, then:

library(dplyr)

df = rbind(df1,df2)

df %>% group_by(var1) %>%
  summarise_all(funs(.[!is.na(.)][1]))
   var1    var2      var3
1     a right.a correct.a
2     b right.b correct.b
3     c right.c correct.c
4     d right.d correct.d
5     e right.e correct.e
6     f    <NA> correct.f
7     g right.g correct.g


来源:https://stackoverflow.com/questions/43100583/aggregating-based-on-near-row-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!