subset

How do I select rows by two criteria in data.table in R

落爺英雄遲暮 提交于 2019-11-29 01:50:30
问题 Let's say I have a data.table and I want to select all the rows where the variable x has a value of b. That is easy library(data.table) DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9) setkey(DT,x) # set a 1-column key DT["b"] By the way, it appears that one has to set a key, if the key is not set to x then this does not work. By the way what would happen if I set two columns as keys? Anyway, moving along, lets say that I want to select all the rows where the variable x was a

Set of all subsets

纵然是瞬间 提交于 2019-11-29 01:35:14
In Python2 I could use def subsets(mySet): return reduce(lambda z, x: z + [y + [x] for y in z], mySet, [[]]) to find all subsets of mySet . Python 3 has removed reduce . What would be an equally concise rewrite of this for Python3? Here's a list of several possible implementations of the power set (the set of all subsets) algorithm in Python. Some are recursive, some are iterative, some of them don't use reduce . Plenty of options to choose from! The function reduce() can always be reaplaced by a for loop. Here's a Python implementation of reduce() : def reduce(function, iterable, start=None):

Difference between subarray, subset & subsequence

与世无争的帅哥 提交于 2019-11-29 00:00:07
问题 I'm a bit confused between subarray, subsequence & subset if I have {1,2,3,4} then subsequence can be {1,2,4} OR {2,4} etc. So basically I can omit some elements but keep the order. subarray would be( say subarray of size 3) {1,2,3} {2,3,4} Then what would be the subset? I'm bit confused between these 3. 回答1: In my opinion, if the given pattern is array, the so called subarray means contiguous subsequence . For example, if given {1, 2, 3, 4}, subarray can be {1, 2, 3} {2, 3, 4} etc. While the

Read FASTA into a dataframe and extract subsequences of FASTA file

时间秒杀一切 提交于 2019-11-28 23:28:36
I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start, end) location? NM_000016 1 3 #"ACA" NM_000775 2 6 #"TAACC" NM_003820 3 5 #"TTC" You should have a look

R subsetting a data frame into multiple data frames based on multiple column values

大兔子大兔子 提交于 2019-11-28 23:10:18
I am trying to subset a data frame, where I get multiple data frames based on multiple column values. Here is my example >df v1 v2 v3 v4 v5 A Z 1 10 12 D Y 10 12 8 E X 2 12 15 A Z 1 10 12 E X 2 14 16 The expected output is something like this where I am splitting this data frame into multiple data frames based on column v1 and v2 >df1 v3 v4 v5 1 10 12 1 10 12 >df2 v3 v4 v5 10 12 8 >df3 v3 v4 v5 2 12 15 2 14 16 I have written a code which is working right now but don't think that's the best way to do it. There must be a better way to do it. Assuming tab is the data.frame having the initial data

Difference between subset and filter from dplyr

我们两清 提交于 2019-11-28 22:36:42
It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other? Example: library(dplyr) df1<-subset(airquality, Temp>80 & Month > 5) df2<-filter(airquality, Temp>80 & Month > 5) summary(df1$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 summary(df2$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 They are, indeed,

SQL: How To Select Earliest Row

我怕爱的太早我们不能终老 提交于 2019-11-28 18:39:27
I have a report that looks something like this: CompanyA Workflow27 June5 CompanyA Workflow27 June8 CompanyA Workflow27 June12 CompanyB Workflow13 Apr4 CompanyB Workflow13 Apr9 CompanyB Workflow20 Dec11 CompanyB Wofkflow20 Dec17 This is done with SQL (specifically, T-SQL version Server 2005): SELECT company , workflow , date FROM workflowTable I would like the report to show just the earliest dates for each workflow: CompanyA Workflow27 June5 CompanyB Workflow13 Apr4 CompanyB Workflow20 Dec11 Any ideas? I can't figure this out. I've tried using a nested select that returns the earliest tray

R: Why is the [[ ]] approach for subsetting a list faster than using $?

二次信任 提交于 2019-11-28 18:10:06
I've been working on a few projects that have required me to do a lot of list subsetting and while profiling code I realised that the object[["nameHere"]] approach to subsetting lists was usually faster than the object$nameHere approach. As an example if we create a list with named components: a.long.list <- as.list(rep(1:1000)) names(a.long.list) <- paste0("something",1:1000) Why is this: system.time ( for (i in 1:10000) { a.long.list[["something997"]] } ) user system elapsed 0.15 0.00 0.16 faster than this: system.time ( for (i in 1:10000) { a.long.list$something997 } ) user system elapsed 0

How to plot a subset of a data frame in R?

徘徊边缘 提交于 2019-11-28 17:58:37
Is there a simple way to do this in R: plot(var1,var2, for all observations in the data frame where var3 < 155) It is possible by creating a new data newdata <- data[which( data$var3 < 155),] but then I have to redefine all the variables newvar1 <- newdata$var1 etc. with(dfr[dfr$var3 < 155,], plot(var1, var2)) should do the trick. Edit regarding multiple conditions: with(dfr[(dfr$var3 < 155) & (dfr$var4 > 27),], plot(var1, var2)) Most straightforward option: plot(var1[var3<155],var2[var3<155]) It does not look good because of code redundancy, but is ok for fast n dirty hacking. This is how I

Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)

夙愿已清 提交于 2019-11-28 16:48:09
问题 This is very similar to a question applying a common function to multiple columns of a data.table uning .SDcols answered thoroughly here. The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD subset. I post a simple example below to show my attempt to solve the problem: dt = data.table(grp = sample(letters[1:3],100, replace = TRUE), v1 = rnorm(100), v2 = rnorm(100), v3 = rnorm(100)) sd.cols = c("v2", "v3") dt.out = dt[,