data.table | 易学教程

Counting overlaps as expected with R data.table foverlaps() or IRanges

阅读更多关于 Counting overlaps as expected with R data.table foverlaps() or IRanges

问题 I'm having difficulty counting overlaps of intervals as I would expect. Here is an R data.table with intervals defined by start to end: > library(data.table) > dt1 = data.table(start=c(1, 5, 3), end=c(10, 15, 8)) > print(dt1) start end 1: 1 10 2: 5 15 3: 3 8 Here is how I would consider overlaps for these intervals, from 0 to 20: [0, 1]: 0 (there are no intervals here) [1, 3]: 1 (there is only one interval here, from [1, 10]) [3, 5]: 2 (two intervals here, both [1, 10] and [3, 8]) [5, 8]: 3

Counting overlaps as expected with R data.table foverlaps() or IRanges

阅读更多关于 Counting overlaps as expected with R data.table foverlaps() or IRanges

Count rows matching a criteria relative to current row

阅读更多关于 Count rows matching a criteria relative to current row

问题 I have a dataframe structured like this (but it actually has ~400k rows): library(data.table) df <- fread(" id start end 174095 2018-12-19 2018-12-31 227156 2018-12-19 2018-12-31 210610 2018-04-13 2018-09-27 27677 2018-04-12 2018-04-26 370474 2017-07-13 2017-08-19 303693 2017-02-20 2017-04-09 74744 2016-10-03 2016-11-05 174095 2018-12-01 2018-12-20 27677 2018-03-01 2018-05-29 111111 2018-01-01 2018-01-31 111111 2018-11-11 2018-12-31") (edited, thanks to Uwe) For each row, I want to count how

R data.table: what is the fastest way to intersect a data.table by multiple columns by keys and groups

阅读更多关于 R data.table: what is the fastest way to intersect a data.table by multiple columns by keys and groups

问题 MAJOR EDIT to clarify as answers are wrong I have a data.table with group columns (split_by), key columns (key_by) and trait ids columns (intersect_by) I want in each group of split_by, keep only the rows where the trait ids are shared by all the present keys in the group. For example: dt <- data.table(id = 1:6, key1 = 1, key2 = c(1:2, 2), group_id1= 1, group_id2= c(1:2, 2:1, 1:2), trait_id1 = 1, trait_id2 = 2:1) setkey(dt, group_id1, group_id2, trait_id1, trait_id2) dt id key1 key2 group_id1

Skipping rows until row with a certain value

阅读更多关于 Skipping rows until row with a certain value

问题 I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form: " ... (number of rows) ... (number of rows) ... (number of rows) DATE VALUE 1945-01-01 144855 1946-01-01 138515 1947-01-01 136405 1948-01-01 135486 1949-01-01 142455 " I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards

What is the fastest way and fastest format for loading large data sets into R [duplicate]

阅读更多关于 What is the fastest way and fastest format for loading large data sets into R [duplicate]

问题 This question already has answers here : Quickly reading very large tables as dataframes (11 answers) Closed 4 years ago . I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set? My suspicion is that the optimal choice is something like saveRDS(obj, file = 'bigdata.Rda',

R rolling join two data.tables with error margin on join

阅读更多关于 R rolling join two data.tables with error margin on join

问题 Note: this question is a copy of this one but with different wording, and a suggestion for data.table instead of dplyr I have two datasets that contain scores for different patients on multiple measuring moments like so: dt1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"), "Days" = c(0,10,25,340,100,538), "Score" = c(NA,2,3,99,5,6), stringsAsFactors = FALSE) dt2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2",

R rolling join two data.tables with error margin on join

阅读更多关于 R rolling join two data.tables with error margin on join

data.table: How do I pass a character vector to a function get data.table to treat its contents as column names?

阅读更多关于 data.table: How do I pass a character vector to a function get data.table to treat its contents as column names?

问题 Here is a data.table: library(data.table) DT <- data.table(airquality) This example produces the output I want: DT[, `:=`(New_Ozone= log(Ozone), New_Wind=log(Wind))] How can I write a function log_those_columns such that the following code snippet outputs the same result? old_names <- c("Ozone", "Wind") new_names <- c("New_Ozone", "New_Wind") log_those_columns(DT, old_names, new_names) Note that I need old_names and new_names to be flexible enough to contain any number of columns. (I see from

How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

阅读更多关于 How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

问题 I have a data.table of allele identities (rows are individuals, columns are loci), grouped by a separate column. I want to calculate allele frequencies (proportions) for each locus efficiently, by group. An example data table: DT = data.table(Loc1=rep(c("G","T"),each=5), Loc2=c("C","A"), Loc3=c("C","G","G","G", "C","G","G","G","G","G"), Group=c(rep("G1",3),rep("G2",4),rep("G3",3))) for(i in 1:3) set(DT, sample(10,2), i, NA) > DT Loc1 Loc2 Loc3 Group 1: G NA C G1 2: G A G G1 3: G C G G1 4: NA