data.table | 易学教程

Creating new columns based on selected columns that calculates the ratio by group

阅读更多关于 Creating new columns based on selected columns that calculates the ratio by group

问题 My data looks as follows: DF <- structure(list(No_Adjusted_Gross_Income = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), NoR_from_1_to_5000 = c(1035373, 4272260, 1124098, 1035373, 4272260, 1124098), NoR_from_5000_to_10000 = c(319540, 4826042, 1959866, 319540, 4826042, 1959866), AGI_from_1_to_5000 = c(2588950186.5, 10682786130, 2810807049, 2588950186.5, 10682786130, 2810807049 ), AGI_from_5000_to_10000 = c(2396550000, 36195315000, 14698995000, 2396550000,

R - multiple criteria search

阅读更多关于 R - multiple criteria search

问题 I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows: 5 million observations 7 variables of which three of interest in this case: Code ID Buy Code ID Sell Date I would like another variable called new , which takes the value 0 in line i if: there exists an observation k , with Code_IB_Buy_[i]=Code_IB_Buy_[k] and Code_IB_Sell_[i]=Code_IB_Sell_[k] and Date[i] is after Date[k] if not, I would like new[i]=1 .

data.table replace NA with mean for multiple columns and by id

阅读更多关于 data.table replace NA with mean for multiple columns and by id

问题 If I have the following data.table: dat <- data.table("id"=c(1,1,1,1,2,2,2,2), "var1"=c(NA,1,2,2,1,1,2,2), "var2"=c(4,4,4,4,5,5,NA,4), "var3"=c(4,4,4,NA,5,5,5,4)) id var1 var2 var3 1: 1 NA 4 4 2: 1 1 4 4 3: 1 2 4 4 4: 1 2 4 NA 5: 2 1 5 5 6: 2 1 5 5 7: 2 2 NA 5 8: 2 2 4 4 How can I replace the missing values with the mean of each column within id? In my actual data I have many variables which for only ones I wish to replace so how could be done in a general way so that for example it is not

Set R data.table row order by chaining 2 columns

阅读更多关于 Set R data.table row order by chaining 2 columns

问题 I'm trying to figure out how to order an R data table based on the chaining of 2 columns. Here's my sample data.table. dt <- data.table(id = c('A', 'A', 'A', 'A', 'A') , col1 = c(7521, 0, 7915, 5222, 5703) , col2 = c(7907, 5703, 8004, 7521, 5222)) id col1 col2 1: A 7521 7907 2: A 0 5703 3: A 7915 8004 4: A 5222 7521 5: A 5703 5222 I need the row order to start with col1 = 0. The col1 value in row 2 should be equal to the value of col2 in the preceding row, and so on. Additionally, there

R combining duplicate rows by ID with different column types in a dataframe

阅读更多关于 R combining duplicate rows by ID with different column types in a dataframe

问题 I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4), abst = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4), farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")), gier = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2)) Now I want to combine the duplicate IDs. The numeric columns are defined as the mean

How to conditionally count and record if a sample appears in rows of another dataset?

阅读更多关于 How to conditionally count and record if a sample appears in rows of another dataset?

问题 I have a genetic dataset of IDs (dataset1) and a dataset of IDs which interact with each other (dataset2). I am trying to count IDs in dataset1 which appear in either of 2 interaction columns in dataset2 and also record which are the interacting/matching IDs in a 3rd column. Dataset1: ID 1 2 3 Dataset2: Interactor1 Interactor2 1 5 2 3 1 10 Output: ID InteractionCount Interactors 1 2 5, 10 2 1 3 3 1 2 So the output contains all IDs of dataset1 and a count of those IDs also appear in either

Join results in more than 2^31 rows (internal vecseq reached physical limit)

阅读更多关于 Join results in more than 2^31 rows (internal vecseq reached physical limit)

问题 I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8) Here is what my code looks like: library(parallel) library(data.table) data1: several million rows and 3 columns. The columns are tag , prod and v . There are 750K unique values of tag , anywhere from 1 to 1000 prod s per tag , 5000 possible values for prod . v takes any positive real value. setkey(data1, tag) merge (data1, data1, allow

Join results in more than 2^31 rows (internal vecseq reached physical limit)

阅读更多关于 Join results in more than 2^31 rows (internal vecseq reached physical limit)

Join results in more than 2^31 rows (internal vecseq reached physical limit)

阅读更多关于 Join results in more than 2^31 rows (internal vecseq reached physical limit)

How to retrieve column for row-wise maximum value in an R data.table?

阅读更多关于 How to retrieve column for row-wise maximum value in an R data.table?

问题 I have the following R data.table: library(data.table) iris = as.data.table(iris) > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa ... Let's say I wanted to find the row-wise maximum value by each row, only for the subset of data.table columns: Sepal.Length , Sepal.Width , Petal