na | 易学教程

Deduplicating/collapsing records in an R dataframe

阅读更多关于 Deduplicating/collapsing records in an R dataframe

I have a dataset that is comprised of various individuals, where each individual has a unique id. Each individual can appear multiple times in the dataset, but it's my understanding that besides differing in one or two variables (there are about 80 for each individual) the values should be the same for each entry for the same user id in the dataset. I want to try to collapse the data if I can. My main obstacle is certain null values that I need to back populate. I'm looking for a function that can accomplish deduplication looking something like this: # Build sample dataset df1 = data.frame(id

R fill in NA with previous row value with condition

阅读更多关于 R fill in NA with previous row value with condition

问题 I need to fill in NA rows with the previous row value, but only until a criteria is not changed. As a simple example for days of week, meals and prices: Day = c("Mon", "Tues", "Wed", "Thus", "Fri", "Sat","Sun","Mon", "Tues", "Wed", "Thus", "Fri", "Sat","Sun") Meal = c("B","B","B","B","B","D","D","D","D","L","L", "L","L","L") Price = c(NA, 20, NA,NA,NA,NA,NA,15,NA,NA,10,10,NA,10) df = data.frame(Meal,Day ,Price ) df Meal Day Price 1 B Mon NA 2 B Tues 20 3 B Wed NA 4 B Thus NA 5 B Fri NA 6 D

data.table do not compute NA groups in by

阅读更多关于 data.table do not compute NA groups in by

问题 This question has a partial answer here but the question is too specific and I'm not able to apply it to my own problem. I would like to skip a potentially heavy computation of the NA group when using by . library(data.table) DT = data.table(X = sample(10), Y = sample(10), g1 = sample(letters[1:2], 10, TRUE), g2 = sample(letters[1:2], 10, TRUE)) set(DT, 1L, 3L, NA) set(DT, 1L, 4L, NA) set(DT, 6L, 3L, NA) set(DT, 6L, 4L, NA) DT[, mean(X*Y), by = .(g1,g2)] Here we can see there are up to 5

na.locf fill NAs up to maxgap even if gap > maxgap, with groups

阅读更多关于 na.locf fill NAs up to maxgap even if gap > maxgap, with groups

问题 I've seen a solution to this, but can't get it to work for groups (Fill NA in a time series only to a limited number), and thought there must be a neater way to do this also? Say I have the following dt: dt <- data.table(ID = c(rep("A", 10), rep("B", 10)), Price = c(seq(1, 10, 1), seq(11, 20, 1))) dt[c(1:2, 5:10), 2] <- NA dt[c(11:13, 15:19) ,2] <- NA dt ID Price 1: A NA 2: A NA 3: A 3 4: A 4 5: A NA 6: A NA 7: A NA 8: A NA 9: A NA 10: A NA 11: B NA 12: B NA 13: B NA 14: B 14 15: B NA 16: B

Interpolate multiple NA values with R

阅读更多关于 Interpolate multiple NA values with R

I want to interpolate multiple NA values in a matrix called, tester. This is a part of tester with only 1 column of NA values, in the whole 744x6 matrix other columns have multiple as well: ZONEID TIMESTAMP U10 V10 U100 V100 1 20121022 12:00 -1.324032e+00 -2.017107e+00 -3.278166e+00 -5.880225574 1 20121022 13:00 -1.295168e+00 NA -3.130429e+00 -6.414975148 1 20121022 14:00 -1.285004e+00 NA -3.068829e+00 -7.101699541 1 20121022 15:00 -9.605904e-01 NA -2.332645e+00 -7.478168285 1 20121022 16:00 -6.268261e-01 -3.057278e+00 -1.440209e+00 -8.026791079 I have installed the zoo package and used the

R data.table NA type consistency

阅读更多关于 R data.table NA type consistency

问题 dt = data.table(x = c(1,1,2,2,2,2,3,3,3,3)) dt[, y := if(.N > 2) .N else NA, by = x] # fail dt[, y := if(.N > 2) .N else NA_integer_, by = x] # good This first grouping fails because NA has a type and it's not integer. Is there a way to tell data table to ignore that and try to make all NAs to whatever type that keeps consistency? I can manually set NA_integer here, but if I have lots of columns of different types, it's hard to set all NA type correct. BTW, what NA type should I use for Date

The difference of na.rm and na.omit in R

阅读更多关于 The difference of na.rm and na.omit in R

I've just started with R and I've executed these statements: library(datasets) head(airquality) s <- split(airquality,airquality$Month) sapply(s, function(x) {colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)}) lapply(s, function(x) {colMeans(na.omit(x[,c("Ozone", "Solar.R", "Wind")])) }) For the sapply , it returns the following: 5 6 7 8 9 Ozone 23.61538 29.44444 59.115385 59.961538 31.44828 Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 Wind 11.62258 10.26667 8.941935 8.793548 10.18000 And for lapply , it returns the following: $`5` Ozone Solar.R Wind 24.12500 182.04167

How to avoid NA columns in dcast() output?

阅读更多关于 How to avoid NA columns in dcast() output?

问题 How can I avoid NA columns in dcast() output from the reshape2 package? In this dummy example the dcast() output will include an NA column: require(reshape2) data(iris) iris[ , "Species2"] <- iris[ , "Species"] iris[ 2:7, "Species2"] <- NA (x <- dcast(iris, Species ~ Species2, value.var = "Sepal.Width", fun.aggregate = length)) ## Species setosa versicolor virginica NA ##1 setosa 44 0 0 6 ##2 versicolor 0 50 0 0 ##3 virginica 0 0 50 0 For a somewhat similar usecase, table() does have an

How to select rows by group with the minimum value and containing NAs in R

阅读更多关于 How to select rows by group with the minimum value and containing NAs in R

问题 Here is an example: set.seed(123) data<-data.frame(X=rep(letters[1:3], each=4),Y=sample(1:12,12),Z=sample(1:100, 12)) data[data==3]<-NA What I am to realize is to select the unique row of X with minimum Y by ignoring NA s: a 4 68 b 1 4 c 2 64 What's the best way to do that? 回答1: Using the data.table package, this is trivial: library(data.table) d <- data.table(data) d[, min(Y, na.rm=TRUE), by=X] You can also use plyr and its ddply function: library(plyr) ddply(data, .(X), summarise, min(Y, na

Replacing NA's in each column of matrix with the median of that column

阅读更多关于 Replacing NA's in each column of matrix with the median of that column

问题 I am trying to replace the NA's in each column of a matrix with the median of of that column, however when I try to use lapply or sapply I get an error; the code works when I use a for-loop and when I change one column at a time, what am I doing wrong? Example: set.seed(1928) mat <- matrix(rnorm(100*110), ncol = 110) mat[sample(1:length(mat), 700, replace = FALSE)] <- NA mat1 <- mat2 <- mat mat1 <- lapply(mat1, function(n) { mat1[is.na(mat1[,n]),n] <- median(mat1[,n], na.rm = TRUE) } ) for (n