data.table | 易学教程

fread() reads big number as 4.076092e-309

阅读更多关于 fread() reads big number as 4.076092e-309

问题 The original numbers are integers from 825010211307012 to 825010304926185 . fread() turns all those numbers to 4.076092e-309 . read.table works normally, but I need to read large data so I can't use it. How can I correct this error? 回答1: If you install the bit64 package then fread will use it to read these large integers: before: > fread("./bignums.txt") V1 1: 4.076092e-309 2: 4.076092e-309 Do the magic: > install.packages("bit64") Then: > fread("./bignums.txt") V1 1: 825010211307012 2:

Fast read different type of data with same command, better seperator guessing [duplicate]

阅读更多关于 Fast read different type of data with same command, better seperator guessing [duplicate]

问题 This question already has answers here : Reading aligned column data with fread (2 answers) Closed last year . I have LD data, sometimes raw output file from PLINK as below (notice spaces - used to make the output pretty, notice leading and trailing spaces, too): write.table(read.table(text=" CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2 1 154834183 rs1218582 1 154794318 rs9970364 0.0929391 1 154834183 rs1218582 1 154795033 rs56744813 0.10075 1 154834183 rs1218582 1 154797272 rs16836414 0.106455 1

efficiently finding first nonzero element (corresponding column) of a data table

阅读更多关于 efficiently finding first nonzero element (corresponding column) of a data table

问题 There are some answers on stack to the below type of question, but they are all inefficient and do not scale well. To reproduce it, suppose I have data that looks like this: tempmat=matrix(c(1,1,0,4,1,0,0,4,0,1,0,4, 0,1,1,4, 0,1,0,5),5,4,byrow=T) tempmat=rbind(rep(0,4),tempmat) tempmat=data.table(tempmat) names(tempmat)=paste0('prod1vint',1:4) This is what the data look like, although it is MUCH bigger, so the solution cannot be an "apply" or row-wise based approach. > tempmat prod1vint1

Aggregate and Weighted Mean for multiple columns in R

阅读更多关于 Aggregate and Weighted Mean for multiple columns in R

问题 The question is basically the samt as this: Aggregate and Weighted Mean in R. But i want it to compute it on several columns, using data.table, as I have millions of rows. So something like this: set.seed(42) # fix seed so that you get the same results dat <- data.frame(assetclass=sample(LETTERS[1:5], 20, replace=TRUE), tax=rnorm(20),tax2=rnorm(20), assets=1e7+1e7*runif(20), assets2=1e6+1e7*runif(20)) DT <- data.table(dat) I can compute the weighted mean on one column, assets, like this: DT[

H2O running slower than data.table R

阅读更多关于 H2O running slower than data.table R

问题 How it is possible that storing data into H2O matrix are slower than in data.table? #Packages used "H2O" and "data.table" library(h2o) library(data.table) #create the matrix matrix1<-data.table(matrix(rnorm(1000*1000),ncol=1000,nrow=1000)) matrix2<-h2o.createFrame(1000,1000) h2o.init(nthreads=-1) #Data.table variable store for(i in 1:1000){ matrix1[i,1]<-3 } #H2O Matrix Frame store for(i in 1:1000){ matrix2[i,1]<-3 } Thanks! 回答1: H2O is a client/server architecture. (See http://docs.h2o.ai

How to do row wise operations on .SD columns in data.table

阅读更多关于 How to do row wise operations on .SD columns in data.table

问题 Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so... I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols . I can never remember if the operations need an sapply , lapply , or if the belong inside the brackets of .SD . As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average

Memory and Performance using grepl on large data.table [duplicate]

阅读更多关于 Memory and Performance using grepl on large data.table [duplicate]

问题 This question already has answers here : Using grep to subset rows from a data.table, comparing row content (2 answers) Closed 4 years ago . I'm performing a simple command in R over a large dataset, and the result is slow and uses too much memory. Here's a an example using two rows, although my real dataset has 154 million rows: library(data.table) Dt<-data.table(title1=c("The coolest song ever", "The greatest music in the world"), title2=c("coolest song","greatest music")) Dt$Match<-sapply

create a filter expression (i) dynamically in data.table

阅读更多关于 create a filter expression (i) dynamically in data.table

问题 Having a data.table library(data.table) dd <- data.table(x=1:10,y=10:1,z=20:20) I can filter it using dd[x %in% c(1, 3) & z %in% c(12, 20)] x y z 1: 1 10 20 2: 3 8 20 Now I would like to create the same filter dynamically. This what I have tried so far: cond <- list(x=c(1,3),z=c(12,20)) vars <- names(cond) ## dd[get(vars[[1]]) %in% cond[[1]] & get(vars[[2]]) %in% cond[[2]]] EVAL = function(...){ expr <- parse(text=paste0(...)) print(expr) eval(expr) } dd[ EVAL(vars, " %in% ", cond, collapse="

Split Data Every N Columns and rbind Using R

阅读更多关于 Split Data Every N Columns and rbind Using R

问题 I have a data frame. In my data set. a, a1 and a2 are the exact same variable. However when you have the same names in r it automatically adds a 1 at the end of the name. df = data.frame(a = rnorm(4), b = rnorm(4), c = rnorm(4), a1 = rnorm(4), b1 = rnorm(4), c1 = rnorm(4), a2 = rnorm(4), b2 = rnorm(4), c2 = rnorm(4), date = seq(as.Date("2019-05-05"),as.Date("2019-05-08"), 1)) print(df) a b c a1 b1 a2 b2 c2 date 1 -1.0938097 1.3948486 1.2805904 1.6187439 1.0200681 -1.4335761 -0.4583526 0

Conditional keyed join/update _and_ update a flag column for matches

阅读更多关于 Conditional keyed join/update _and_ update a flag column for matches

问题 This is very similar to the question @DavidArenburg asked about conditional keyed joins, with an additional bugbear that I can't seem to suss out. Basically, in addition to a conditional join, I want to define a flag saying at which step of the matching process that the match occurred; my problem is that I can only get the flag to define for all values, not the matched values. Here's what I hope is a minimal working example: DT = data.table( name = c("Joe", "Joe", "Jim", "Carol", "Joe",