subset | 易学教程

r subset rows by criteria and by factor group

阅读更多关于 r subset rows by criteria and by factor group

问题 I have this data.frame with a lot of NAs: df <- data.frame(a = rep(letters[1:3], each = 3), b = c(NA, NA, NA, 1, NA, 3, NA, NA, 7)) df > df a b 1 a NA 2 a NA 3 a NA 4 b 1 5 b NA 6 b 3 7 c NA 8 c NA 9 c 7 I would like to subset this dataframe to obtain only factor group rows that have no less than two values, such as this: a b 1 b 1 2 b NA 3 b 3 I have tried this function but it doesn't work: subset(df, sum(!is.na(b)) < 1, by = a) > [1] a b <0 rows> (or 0-length row.names) Any suggestion?

Subset a dataframe based on a single condition applied to multiple columns

阅读更多关于 Subset a dataframe based on a single condition applied to multiple columns

问题 I've had a look through the existing subset Q&A's on this site and couldn't quite find what I was looking for. I want to subset a data frame based on one condition (e.g. if the value is below 5). However, I only want the rows where the value in all of the columns is below 5. For example using the iris dataset - I would like to select all the rows where columns 1-3 all have values below 5. subdata <- iris[which(iris[,1:3]<5),] This doesn't do it for me. I get lots of NA rows at the bottom of

subsetting dataframe in R using two criteria, one of them is regular expression

阅读更多关于 subsetting dataframe in R using two criteria, one of them is regular expression

问题 I have a dataset something like this: col_a col_b col_c 1 abc_boy 1 2 abc_boy 2 1 abc_girl 1 2 abc_girl 2 I need to pick up the first row only based on col_b and col_c , and then change the valye in col_c , which is something like this: df[grep("_boy$",df[,"col_b"]) & df[,"col_c"]=="1","col_c"] <- "yes" But the code above is not OK, since the first criteria and the second criteria do not originate from the same set. I can do it in a dumb way by using a explicit loop, or do a "two-tier"

Subsetting one matrix based in another matrix

阅读更多关于 Subsetting one matrix based in another matrix

问题 I would like to select the R based on G strings to obtain separated outputs with equal dimensions. This are my inputs: R <- 'pr_id sample1 sample2 sample3 AX-1 100 120 130 AX-2 150 180 160 AX-3 160 120 196' R <- read.table(text=R, header=T) G <- 'pr_id sample1 sample2 sample3 AX-1 AB AA AA AX-2 BB AB NA AX-3 BB AB AA' G <- read.table(text=G, header=T) This are my expected outputs: AA <- 'pr_id sample1 sample2 sample3 AX-1 NA 120 130 AX-2 NA NA NA AX-3 NA NA 196' AA <- read.table(text=AA,

R error promise already under evaluation when using subset in function but no error in script

阅读更多关于 R error promise already under evaluation when using subset in function but no error in script

I'm getting a strange error when I run the following function: TypeIDs=c(18283,18284,17119,17121,17123,17125,17127,17129,17131,17133,18367,18369,18371,18373,18375,18377,18379) featsave<-function(featfile,TypeIDs=TypeIDs) { mydata1<-read.table(featfile,header=TRUE) mydata2<-subset(mydata1,TypeID %in% TypeIDs) mydata<-as.data.frame(cast(mydata2, Feat1 + Feat2 + ID ~ TypeID,value="value")) save(mydata,file="mydatafile.Rdata",compress=TRUE) return(mydata) } with the following data: Feat1 Feat2 ID Feat3 Feat4 TypeID value 1 1 1 6 266 18283 280.00 1 1 1 6 266 18284 20.00 1 1 1 6 266 18285 0.00 1 1 1

Most efficient way of subsetting dataframes

阅读更多关于 Most efficient way of subsetting dataframes

问题 Can anyone suggest more efficient way of subsetting dataframe without using SQL/indexing/data.table options? I looked for similar questions, and this one suggests indexing option. Here are ways to subset with timings. #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Subset and time system.time(x <- dat[dat$x > 500, ]) # user system elapsed # 0.092 0.000 0.090 system.time(x <- dat[which(dat$x > 500), ]) # user system elapsed # 0.040 0.032 0.070 system.time

Determine which column name is causing 'undefined columns selected' error when using subset()

阅读更多关于 Determine which column name is causing 'undefined columns selected' error when using subset()

问题 I'm trying to subset a large data frame from a very large data frame, using data.new <- subset(data, select = vector) where vector is a character string containing the column names I'm trying to isolate. When I do this I get Error in `[.data.frame`(x, r, vars, drop = drop) : undefined columns selected Is there a way to identify which specific column name in the vector is undefined? Through trial and error I've narrowed it down to about 400, but that still doesn't help. 回答1: Find the elements

How to assign same color to factors across plots in a nested loop for ggplot?

阅读更多关于 How to assign same color to factors across plots in a nested loop for ggplot?

问题 I am trying to use scale_fill_manual to assign corresponding colors to factors across many plots in a nested for loop. However, the resulting plots end up all being black. My overall loop is as follows: for(i in seq(from=0, to=100, by=10)){ for{j in seq(from=0, to=100, by=10)){ print(ggplot(aes(x , y), data = df)+ geom_point(inherit.aes = FALSE,data = subset(df,factor_x==i&factor_y==j), aes(x, y, size=point,color=Group))+ theme_bw()}} I am trying to assign each factor in "Group" its own color

In R: subset or dplyr::filter with variable from vector

阅读更多关于 In R: subset or dplyr::filter with variable from vector

问题 df <- data.frame(a=LETTERS[1:4], b=rnorm(4) ) vals <- c("B","D") I can filter/subset df with values in val with: dplyr::filter(df, a %in% vals) subset(df, a %in% vals) Both gives: a b 2 B 0.4481627 4 D 0.2916513 What if I have a variable name in a vector, e.g.: > names(df)[1] [1] "a" Then it doesnt work - I guess because its quoted dplyr::filter(df, names(df)[1] %in% vals) [1] a b <0 rows> (or 0-length row.names) How do you do this ? UPDATE ( what if its dplyr::tbl_df(df) ) Answers below work

Data Frame Subset Performance

阅读更多关于 Data Frame Subset Performance

问题 I have a couple of large data frames (1 million+ rows x 6-10 columns) I need to subset repeatedly. The subsetting section is the slowest part of my code and I curious if there is way to do this faster. load("https://dl.dropbox.com/u/4131944/Temp/DF_IOSTAT_ALL.rda") start_in <- strptime("2012-08-20 13:00", "%Y-%m-%d %H:%M") end_in<- strptime("2012-08-20 17:00", "%Y-%m-%d %H:%M") system.time(DF_IOSTAT_INT <- DF_IOSTAT_ALL[DF_IOSTAT_ALL$date_stamp >= start_in & DF_IOSTAT_ALL$date_stamp <= end_in