How to fill NAs with LOCF by factors in data frame, split by country

后端 未结 8 1950
谎友^
谎友^ 2020-12-08 05:01

I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values:

country value
AUT     NA
AUT            


        
相关标签:
8条回答
  • 2020-12-08 05:38

    I'm a little late to this conversation, but here is a data.table way, which will be much faster for larger data sets:

    library(zoo)
    library(data.table)
    
    # Convert to data table
    setDT(data)
    
    data[, value := na.locf(value, na.rm = FALSE), by = country]
    
    data
       country  value
    1:     AUT     NA
    2:     AUT      5
    3:     AUT      5
    4:     AUT      5
    5:     GER     NA
    6:     GER     NA
    7:     GER      7
    8:     GER      7
    9:     GER      7
    
    # And if you want to convert "data" back to a data frame...
    setDF(data)
    
    0 讨论(0)
  • 2020-12-08 05:38

    A combination of the packages dplyr and imputeTS can do the job.

    library(dplyr)
    library(imputeTS)
    data %>% group_by(country) %>% 
    mutate(value = na.locf(value, na.remaining="keep"))   
    

    With the na.remaining parameter of the na.locf function of imputeTS you have additionally the option to choose, what to do with the trailing NAs.

    These are the options:

    • "keep" - return the series with NAs
    • "rm" - remove remaining NAs
    • "mean" - replace remaining NAs by overall mean
    • "rev" - perform nocb / locf from the reverse direction

    By choosing "mean" you would for example get a result with 7 for every GER in the specific example.

    0 讨论(0)
  • 2020-12-08 05:45

    You simply need to split by country, then a do either a zoo::na.locf() or na.fill, filling to the right. Here is an example explicitly showing the three-component arg syntax of na.fill:

    library(plyr)
    library(zoo)
    
    data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA))
    
    # The following is equivalent to na.locf
    na.fill.right <- function(...) { na.fill(..., list(left=NA,interior=NA,right="extend")) }
    
    ddply(data, .(country), na.fill.right)
    
      country value
    1     AUT  <NA>
    2     AUT     5
    3     AUT     5
    4     AUT     5
    5     GER  <NA>
    6     GER  <NA>
    7     GER     7
    8     GER     7
    9     GER     7
    
    0 讨论(0)
  • 2020-12-08 05:46

    If speed is a consideration then this unstack/stack solution is about 4 to 6 times faster than the others on my system although it does entail a slightly longer line of code:

    stack(lapply(unstack(data, value ~ country), na.locf, na.rm = FALSE))
    

    Another approach is:

    transform(data, value = ave(value, country, FUN = na.locf0))
    
    0 讨论(0)
  • 2020-12-08 05:51

    Here's a ddply solution. Try this

    library(plyr)
    ddply(DF, .(country), na.locf)
      country value
    1     AUT  <NA>
    2     AUT     5
    3     AUT     5
    4     AUT     5
    5     GER  <NA>
    6     GER  <NA>
    7     GER     7
    8     GER     7
    9     GER     7
    

    Edit From ddply help you can find that

    .variables:  variables to split data frame by, 
    as quoted variables, a formula or character vector.
    

    so another alternatives to get what you want are:

    ddply(DF, "country", na.locf)
    ddply(DF, ~country, na.locf)
    

    note that replacing .variables with DF$variable is not allowed, that's why you got an error when doing this.

    DF is your data.frame

    0 讨论(0)
  • 2020-12-08 05:55

    Split the data.frame with by and use na.locf on the subsets:

    do.call(rbind,by(data,data$country,na.locf))
    

    If you would like to remove the row names:

    do.call(rbind,unname(by(data,data$country,na.locf)))
    
    0 讨论(0)
提交回复
热议问题