可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) : cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables? I know this is a very simple task, but I have never been across any function that does this.
Thanks,
回答1:
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5)) df # x y # 1 1 1 # 2 2 1 # 3 3 1 # 4 4 1 # 5 5 1 # Supply names of columns that have 0 variance names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)]) # [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
回答2:
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000 nrow = 10 df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol) df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method time1 # user system elapsed # 22.267 0.194 22.626 time2 # user system elapsed # 2.073 0.077 2.155 time3 # user system elapsed # 6.702 0.060 6.790 all.equal(df1, df2) # [1] TRUE all.equal(df3, df2) # [1] TRUE
回答3:
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and @raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an exemple data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000 nrow = 100000 df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol) df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method time4 <- system.time(df4 <- df[,-whichAreConstant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch # user system elapsed # 2.55 1.45 4.07 time2 # Min = max approach # user system elapsed # 2.72 1.5 4.22 time3 # length(unique()) approach # user system elapsed # 6.7 2.75 9.53 time4 # Exponential search approach # user system elapsed # 0.39 0.07 0.45 all.equal(df1, df2) # [1] TRUE all.equal(df3, df2) # [1] TRUE all.equal(df4, df2) # [1] TRUE
dataPreparation:whichAreConstant is 10 times faster than the other approachs.
Plus the more rows you have the more intersting it is to use.