data.table

Counting overlaps as expected with R data.table foverlaps() or IRanges

六眼飞鱼酱① 提交于 2020-06-27 12:44:55
问题 I'm having difficulty counting overlaps of intervals as I would expect. Here is an R data.table with intervals defined by start to end: > library(data.table) > dt1 = data.table(start=c(1, 5, 3), end=c(10, 15, 8)) > print(dt1) start end 1: 1 10 2: 5 15 3: 3 8 Here is how I would consider overlaps for these intervals, from 0 to 20: [0, 1]: 0 (there are no intervals here) [1, 3]: 1 (there is only one interval here, from [1, 10]) [3, 5]: 2 (two intervals here, both [1, 10] and [3, 8]) [5, 8]: 3

Counting overlaps as expected with R data.table foverlaps() or IRanges

风流意气都作罢 提交于 2020-06-27 12:44:05
问题 I'm having difficulty counting overlaps of intervals as I would expect. Here is an R data.table with intervals defined by start to end: > library(data.table) > dt1 = data.table(start=c(1, 5, 3), end=c(10, 15, 8)) > print(dt1) start end 1: 1 10 2: 5 15 3: 3 8 Here is how I would consider overlaps for these intervals, from 0 to 20: [0, 1]: 0 (there are no intervals here) [1, 3]: 1 (there is only one interval here, from [1, 10]) [3, 5]: 2 (two intervals here, both [1, 10] and [3, 8]) [5, 8]: 3

Count rows matching a criteria relative to current row

对着背影说爱祢 提交于 2020-06-27 11:17:30
问题 I have a dataframe structured like this (but it actually has ~400k rows): library(data.table) df <- fread(" id start end 174095 2018-12-19 2018-12-31 227156 2018-12-19 2018-12-31 210610 2018-04-13 2018-09-27 27677 2018-04-12 2018-04-26 370474 2017-07-13 2017-08-19 303693 2017-02-20 2017-04-09 74744 2016-10-03 2016-11-05 174095 2018-12-01 2018-12-20 27677 2018-03-01 2018-05-29 111111 2018-01-01 2018-01-31 111111 2018-11-11 2018-12-31") (edited, thanks to Uwe) For each row, I want to count how

R data.table: what is the fastest way to intersect a data.table by multiple columns by keys and groups

泄露秘密 提交于 2020-06-26 14:27:57
问题 MAJOR EDIT to clarify as answers are wrong I have a data.table with group columns (split_by), key columns (key_by) and trait ids columns (intersect_by) I want in each group of split_by, keep only the rows where the trait ids are shared by all the present keys in the group. For example: dt <- data.table(id = 1:6, key1 = 1, key2 = c(1:2, 2), group_id1= 1, group_id2= c(1:2, 2:1, 1:2), trait_id1 = 1, trait_id2 = 2:1) setkey(dt, group_id1, group_id2, trait_id1, trait_id2) dt id key1 key2 group_id1

Skipping rows until row with a certain value

主宰稳场 提交于 2020-06-26 11:32:22
问题 I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form: " ... (number of rows) ... (number of rows) ... (number of rows) DATE VALUE 1945-01-01 144855 1946-01-01 138515 1947-01-01 136405 1948-01-01 135486 1949-01-01 142455 " I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards

What is the fastest way and fastest format for loading large data sets into R [duplicate]

风流意气都作罢 提交于 2020-06-25 07:02:34
问题 This question already has answers here : Quickly reading very large tables as dataframes (11 answers) Closed 4 years ago . I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set? My suspicion is that the optimal choice is something like saveRDS(obj, file = 'bigdata.Rda',

R rolling join two data.tables with error margin on join

ε祈祈猫儿з 提交于 2020-06-25 06:33:29
问题 Note: this question is a copy of this one but with different wording, and a suggestion for data.table instead of dplyr I have two datasets that contain scores for different patients on multiple measuring moments like so: dt1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"), "Days" = c(0,10,25,340,100,538), "Score" = c(NA,2,3,99,5,6), stringsAsFactors = FALSE) dt2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2",

R rolling join two data.tables with error margin on join

落爺英雄遲暮 提交于 2020-06-25 06:33:28
问题 Note: this question is a copy of this one but with different wording, and a suggestion for data.table instead of dplyr I have two datasets that contain scores for different patients on multiple measuring moments like so: dt1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"), "Days" = c(0,10,25,340,100,538), "Score" = c(NA,2,3,99,5,6), stringsAsFactors = FALSE) dt2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2",

data.table: How do I pass a character vector to a function get data.table to treat its contents as column names?

有些话、适合烂在心里 提交于 2020-06-24 14:18:13
问题 Here is a data.table: library(data.table) DT <- data.table(airquality) This example produces the output I want: DT[, `:=`(New_Ozone= log(Ozone), New_Wind=log(Wind))] How can I write a function log_those_columns such that the following code snippet outputs the same result? old_names <- c("Ozone", "Wind") new_names <- c("New_Ozone", "New_Wind") log_those_columns(DT, old_names, new_names) Note that I need old_names and new_names to be flexible enough to contain any number of columns. (I see from

How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

允我心安 提交于 2020-06-21 21:23:27
问题 I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table: DT = data.table(Loc1=rep(c("G","T"),each=5), Loc2=c("C","A"), Loc3=c("C","G","G","G", "C","G","G","G","G","G"), Group=c(rep("G1",3),rep("G2",4),rep("G3",3))) for(i in 1:3) set(DT, sample(10,2), i, NA) > DT Loc1 Loc2 Loc3 Group 1: G NA C G1 2: G A G G1 3: G C G G1 4: NA