data.table

Creating new columns based on selected columns that calculates the ratio by group

血红的双手。 提交于 2020-06-17 13:23:05
问题 My data looks as follows: DF <- structure(list(No_Adjusted_Gross_Income = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), NoR_from_1_to_5000 = c(1035373, 4272260, 1124098, 1035373, 4272260, 1124098), NoR_from_5000_to_10000 = c(319540, 4826042, 1959866, 319540, 4826042, 1959866), AGI_from_1_to_5000 = c(2588950186.5, 10682786130, 2810807049, 2588950186.5, 10682786130, 2810807049 ), AGI_from_5000_to_10000 = c(2396550000, 36195315000, 14698995000, 2396550000,

R - multiple criteria search

你。 提交于 2020-06-17 11:05:25
问题 I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows: 5 million observations 7 variables of which three of interest in this case: Code ID Buy Code ID Sell Date I would like another variable called new , which takes the value 0 in line i if: there exists an observation k , with Code_IB_Buy_[i]=Code_IB_Buy_[k] and Code_IB_Sell_[i]=Code_IB_Sell_[k] and Date[i] is after Date[k] if not, I would like new[i]=1 .

data.table replace NA with mean for multiple columns and by id

一曲冷凌霜 提交于 2020-06-14 06:45:09
问题 If I have the following data.table: dat <- data.table("id"=c(1,1,1,1,2,2,2,2), "var1"=c(NA,1,2,2,1,1,2,2), "var2"=c(4,4,4,4,5,5,NA,4), "var3"=c(4,4,4,NA,5,5,5,4)) id var1 var2 var3 1: 1 NA 4 4 2: 1 1 4 4 3: 1 2 4 4 4: 1 2 4 NA 5: 2 1 5 5 6: 2 1 5 5 7: 2 2 NA 5 8: 2 2 4 4 How can I replace the missing values with the mean of each column within id? In my actual data I have many variables which for only ones I wish to replace so how could be done in a general way so that for example it is not

Set R data.table row order by chaining 2 columns

送分小仙女□ 提交于 2020-06-11 21:32:15
问题 I'm trying to figure out how to order an R data table based on the chaining of 2 columns. Here's my sample data.table. dt <- data.table(id = c('A', 'A', 'A', 'A', 'A') , col1 = c(7521, 0, 7915, 5222, 5703) , col2 = c(7907, 5703, 8004, 7521, 5222)) id col1 col2 1: A 7521 7907 2: A 0 5703 3: A 7915 8004 4: A 5222 7521 5: A 5703 5222 I need the row order to start with col1 = 0. The col1 value in row 2 should be equal to the value of col2 in the preceding row, and so on. Additionally, there

R combining duplicate rows by ID with different column types in a dataframe

こ雲淡風輕ζ 提交于 2020-06-01 05:59:27
问题 I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4), abst = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4), farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")), gier = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2)) Now I want to combine the duplicate IDs. The numeric columns are defined as the mean

How to conditionally count and record if a sample appears in rows of another dataset?

家住魔仙堡 提交于 2020-05-30 09:44:36
问题 I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column. Dataset1: ID 1 2 3 Dataset2: Interactor1 Interactor2 1 5 2 3 1 10 Output: ID InteractionCount Interactors 1 2 5, 10 2 1 3 3 1 2 So the output contains all IDs of dataset1 and a count of those IDs also appear in either

Join results in more than 2^31 rows (internal vecseq reached physical limit)

风流意气都作罢 提交于 2020-05-27 06:19:22
问题 I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8) Here is what my code looks like: library(parallel) library(data.table) data1: several million rows and 3 columns. The columns are tag , prod and v . There are 750K unique values of tag , anywhere from 1 to 1000 prod s per tag , 5000 possible values for prod . v takes any positive real value. setkey(data1, tag) merge (data1, data1, allow

Join results in more than 2^31 rows (internal vecseq reached physical limit)

廉价感情. 提交于 2020-05-27 06:18:27
问题 I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8) Here is what my code looks like: library(parallel) library(data.table) data1: several million rows and 3 columns. The columns are tag , prod and v . There are 750K unique values of tag , anywhere from 1 to 1000 prod s per tag , 5000 possible values for prod . v takes any positive real value. setkey(data1, tag) merge (data1, data1, allow

Join results in more than 2^31 rows (internal vecseq reached physical limit)

不想你离开。 提交于 2020-05-27 06:16:48
问题 I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8) Here is what my code looks like: library(parallel) library(data.table) data1: several million rows and 3 columns. The columns are tag , prod and v . There are 750K unique values of tag , anywhere from 1 to 1000 prod s per tag , 5000 possible values for prod . v takes any positive real value. setkey(data1, tag) merge (data1, data1, allow

How to retrieve column for row-wise maximum value in an R data.table?

我们两清 提交于 2020-05-26 19:50:50
问题 I have the following R data.table: library(data.table) iris = as.data.table(iris) > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa ... Let's say I wanted to find the row-wise maximum value by each row, only for the subset of data.table columns: Sepal.Length , Sepal.Width , Petal