How to replace NA with mean by subset in R (impute with plyr?)

匿名 (未验证) 提交于 2019-12-03 09:05:37

问题:

I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean.

For example imagine I have a dataframe that looks something like:

id    taxa        length  width 101   collembola  2.1     0.9 102   mite        0.9     0.7 103   mite        1.1     0.8 104   collembola  NA      NA 105   collembola  1.5     0.5 106   mite        NA      NA 

In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn.

Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps.

exampleDF 

Here are a few things I've tried (that haven't worked):

# mean imputation to recode NA in length and width with means    (could do random imputation but unnecessary here) mean.imp 

another attempt:

imp.mean 

Any suggestions using plyr or not?

回答1:

Not my own technique I saw it on the boards a while back:

dat 

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {     for (j in which(is.na(dat[, i]))) {         dat[j, i] 

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table) setDT(dat)  dat[, length := impute.mean(length), by = taxa][,     width := impute.mean(width), by = taxa] 

dplyr

library(dplyr)  dat %>%     group_by(taxa) %>%     mutate(         length = impute.mean(length),         width = impute.mean(width)       ) 


回答2:

Before answering this, I want to say that am a beginner in R. Hence, please let me know if you feel my answer is wrong.

Code:

DF[is.na(DF$length), "length"] 

and apply the same for width.

DF stands for name of the data.frame.

Thanks, Parthi



回答3:

Expanding on @Tyler Rinker's solution, suppose features are the columns to impute. In this case features . Then using data.table the solution becomes:

library(data.table) setDT(dat)  dat[, (features) := lapply(.SD, impute.mean), by = taxa, .SDcols = features] 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!