Remove NA columns in a list of dataframes

非 Y 不嫁゛ 提交于 2019-12-23 01:36:27

问题


I am having some trouble cleaning data that I imported from Excel with readxl. readxl created a large list of objects with classes = c('data.frame', tbl_df, tbl) (I would also like to know about why/how it has multiple classes assigned to it). Each of those objects is one of the sheets in the original Excel workbook. The problem is that each of those objects (sheets) may have many columns entirely filled with NAs. I have scanned through stackoverflow and found some similar problems and tried to apply the given solutions like here and here (the first one is the most like my problem). However when I try this:

lapply(x, function(y) y[, !is.na(y)])

I get the following error:

Error in `[.data.frame`(y, , !is.na(y)) : undefined columns selected

I've also tried this:

lapply(x, function(y) y[!is.na(y)]

but it reduces all of my dataframes to only the first column. I think I know it's something to do with my dataframe-within-list syntax. I've experimented with different iterations of y[[]][] and even recently found this interesting pattern in lapply: lapply(x, "[[", y), but couldn't make it work.

Here are the first two objects in my list of dataframes (any hints on how to be more efficient in dput-ing this data are also appreciated). As you can see, the first object has no NA columns, whereas the second has 5 NA columns. I would like to remove those 5 NA columns, but do so for all objects in my list.

Any help is greatly appreciated!

dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612, 
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89, 
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173, 
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level", 
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl", 
"data.frame"))

dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688, 
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031, 
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651, 
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL", 
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =    
c("tbl_df", "tbl", "data.frame"))

回答1:


How about this:

lapply(df_list, function(df) df[, colSums(is.na(df)) == 0])

Or maybe:

lapply(df_list, function(df) df[, colSums(is.na(df)) < nrow(df)])

if you want to allow some, but not all rows to be NA



来源:https://stackoverflow.com/questions/33419200/remove-na-columns-in-a-list-of-dataframes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!