subset | 易学教程

How do I select rows by two criteria in data.table in R

阅读更多关于 How do I select rows by two criteria in data.table in R

问题 Let's say I have a data.table and I want to select all the rows where the variable x has a value of b. That is easy library(data.table) DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) setkey(DT,x) # set a 1-column key DT["b"] By the way, it appears that one has to set a key, if the key is not set to x then this does not work. By the way what would happen if I set two columns as keys? Anyway, moving along, lets say that I want to select all the rows where the variable x was a

Set of all subsets

阅读更多关于 Set of all subsets

In Python2 I could use def subsets(mySet): return reduce(lambda z, x: z + [y + [x] for y in z], mySet, [[]]) to find all subsets of mySet . Python 3 has removed reduce . What would be an equally concise rewrite of this for Python3? Here's a list of several possible implementations of the power set (the set of all subsets) algorithm in Python. Some are recursive, some are iterative, some of them don't use reduce . Plenty of options to choose from! The function reduce() can always be reaplaced by a for loop. Here's a Python implementation of reduce() : def reduce(function, iterable, start=None):

Difference between subarray, subset & subsequence

阅读更多关于 Difference between subarray, subset & subsequence

问题 I'm a bit confused between subarray, subsequence & subset if I have {1,2,3,4} then subsequence can be {1,2,4} OR {2,4} etc. So basically I can omit some elements but keep the order. subarray would be( say subarray of size 3) {1,2,3} {2,3,4} Then what would be the subset? I'm bit confused between these 3. 回答1: In my opinion, if the given pattern is array, the so called subarray means contiguous subsequence . For example, if given {1, 2, 3, 4}, subarray can be {1, 2, 3} {2, 3, 4} etc. While the

Read FASTA into a dataframe and extract subsequences of FASTA file

阅读更多关于 Read FASTA into a dataframe and extract subsequences of FASTA file

I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start, end) location? NM_000016 1 3 #"ACA" NM_000775 2 6 #"TAACC" NM_003820 3 5 #"TTC" You should have a look

R subsetting a data frame into multiple data frames based on multiple column values

阅读更多关于 R subsetting a data frame into multiple data frames based on multiple column values

I am trying to subset a data frame, where I get multiple data frames based on multiple column values. Here is my example >df v1 v2 v3 v4 v5 A Z 1 10 12 D Y 10 12 8 E X 2 12 15 A Z 1 10 12 E X 2 14 16 The expected output is something like this where I am splitting this data frame into multiple data frames based on column v1 and v2 >df1 v3 v4 v5 1 10 12 1 10 12 >df2 v3 v4 v5 10 12 8 >df3 v3 v4 v5 2 12 15 2 14 16 I have written a code which is working right now but don't think that's the best way to do it. There must be a better way to do it. Assuming tab is the data.frame having the initial data

Difference between subset and filter from dplyr

阅读更多关于 Difference between subset and filter from dplyr

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other? Example: library(dplyr) df1<-subset(airquality, Temp>80 & Month > 5) df2<-filter(airquality, Temp>80 & Month > 5) summary(df1$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 summary(df2$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 They are, indeed,

SQL: How To Select Earliest Row

阅读更多关于 SQL: How To Select Earliest Row

I have a report that looks something like this: CompanyA Workflow27 June5 CompanyA Workflow27 June8 CompanyA Workflow27 June12 CompanyB Workflow13 Apr4 CompanyB Workflow13 Apr9 CompanyB Workflow20 Dec11 CompanyB Wofkflow20 Dec17 This is done with SQL (specifically, T-SQL version Server 2005): SELECT company , workflow , date FROM workflowTable I would like the report to show just the earliest dates for each workflow: CompanyA Workflow27 June5 CompanyB Workflow13 Apr4 CompanyB Workflow20 Dec11 Any ideas? I can't figure this out. I've tried using a nested select that returns the earliest tray

R: Why is the [[ ]] approach for subsetting a list faster than using $?

阅读更多关于 R: Why is the [[ ]] approach for subsetting a list faster than using $?

I've been working on a few projects that have required me to do a lot of list subsetting and while profiling code I realised that the object[["nameHere"]] approach to subsetting lists was usually faster than the object$nameHere approach. As an example if we create a list with named components: a.long.list <- as.list(rep(1:1000)) names(a.long.list) <- paste0("something",1:1000) Why is this: system.time ( for (i in 1:10000) { a.long.list[["something997"]] } ) user system elapsed 0.15 0.00 0.16 faster than this: system.time ( for (i in 1:10000) { a.long.list$something997 } ) user system elapsed 0

How to plot a subset of a data frame in R?

阅读更多关于 How to plot a subset of a data frame in R?

Is there a simple way to do this in R: plot(var1,var2, for all observations in the data frame where var3 < 155) It is possible by creating a new data newdata <- data[which( data$var3 < 155),] but then I have to redefine all the variables newvar1 <- newdata$var1 etc. with(dfr[dfr$var3 < 155,], plot(var1, var2)) should do the trick. Edit regarding multiple conditions: with(dfr[(dfr$var3 < 155) & (dfr$var4 > 27),], plot(var1, var2)) Most straightforward option: plot(var1[var3<155],var2[var3<155]) It does not look good because of code redundancy, but is ok for fast n dirty hacking. This is how I

Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)

阅读更多关于 Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)

问题 This is very similar to a question applying a common function to multiple columns of a data.table uning .SDcols answered thoroughly here. The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD subset. I post a simple example below to show my attempt to solve the problem: dt = data.table(grp = sample(letters[1:3],100, replace = TRUE), v1 = rnorm(100), v2 = rnorm(100), v3 = rnorm(100)) sd.cols = c("v2", "v3") dt.out = dt[,