plyr | 易学教程

How to strsplit different number of strings in certain column by do function

阅读更多关于 How to strsplit different number of strings in certain column by do function

问题 I have a problem with split column value when element of column has different number of strings. I can do it in plyr e.g.: library(plyr) column <- c(\"jake\", \"jane jane\",\"john john john\") df <- data.frame(1:3, name = column) df$name <- as.character(df$name) df2 <- ldply(strsplit(df$name, \" \"), rbind) View(df2) As a result, we have data frame with number of column related to maximum number of stings in given element. When I try to do it in dplyr, I used do function: library(dplyr) df2 <

Idiomatic R code for partitioning a vector by an index and performing an operation on that partition

阅读更多关于 Idiomatic R code for partitioning a vector by an index and performing an operation on that partition

I'm trying to find the idiomatic way in R to partition a numerical vector by some index vector, find the sum of all numbers in that partition and then divide each individual entry by that partition sum. In other words, if I start with this: df <- data.frame(x = c(1,2,3,4,5,6), index = c('a', 'a', 'b', 'b', 'c', 'c')) I want the output to create a vector (let's call it z): c(1/(1+2), 2/(1+2), 3/(3+4), 3/(3+4), 5/(5+6), 6/(5+6)) If I were doing this is SQL and could use window functions, I would do this: select x / sum(x) over (partition by index) as z from df and if I were using plyr, I would

How to create a lag variable within each group?

阅读更多关于 How to create a lag variable within each group?

问题 I have a data.table: set.seed(1) data <- data.table(time = c(1:3, 1:4), groups = c(rep(c(\"b\", \"a\"), c(3, 4))), value = rnorm(7)) data # groups time value # 1: b 1 -0.6264538 # 2: b 2 0.1836433 # 3: b 3 -0.8356286 # 4: a 1 1.5952808 # 5: a 2 0.3295078 # 6: a 3 -0.8204684 # 7: a 4 0.4874291 I want to compute a lagged version of the \"value\" column, within each level of \"groups\". The result should look like # groups time value lag.value # 1 a 1 1.5952808 NA # 2 a 2 0.3295078 1.5952808 # 3

Aggregate a data frame based on unordered pairs of columns

阅读更多关于 Aggregate a data frame based on unordered pairs of columns

问题 I have a data set that looks something like this: id1 id2 size 1 5400 5505 7 2 5033 5458 1 3 5452 2873 24 4 5452 5213 2 5 5452 4242 26 6 4823 4823 4 7 5505 5400 11 Where id1 and id2 are unique nodes in a graph, and size is a value assigned to the directed edge connecting them from id1 to id2 . This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1 and id2 . For example, in the first row, we have id1

Aggregate a dataframe on a given column and display another column

阅读更多关于 Aggregate a dataframe on a given column and display another column

问题 I have a dataframe in R of the following form: > head(data) Group Score Info 1 1 1 a 2 1 2 b 3 1 3 c 4 2 4 d 5 2 3 e 6 2 1 f I would like to aggregate it following the Score column using the max function > aggregate(data$Score, list(data$Group), max) Group.1 x 1 1 3 2 2 4 But I also would like to display the Info column associated to the maximum value of the Score column for each group. I have no idea how to do this. My desired output would be: Group.1 x y 1 1 3 c 2 2 4 d Any hint? 回答1: First

dplyr summarise: Equivalent of “.drop=FALSE” to keep groups with zero length in output

阅读更多关于 dplyr summarise: Equivalent of “.drop=FALSE” to keep groups with zero length in output

问题 When using summarise with plyr \'s ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE . However, this doesn\'t work when using summarise with dplyr . Is there another way to keep empty categories in the result? Here\'s an example with fake data. library(dplyr) df = data.frame(a=rep(1:3,4), b=rep(1:2,6)) # Now add an extra level to df$b that has no corresponding value in df$a df$b = factor(df$b, levels=1:3) # Summarise with plyr,

Applying a function to every row of a table using dplyr?

阅读更多关于 Applying a function to every row of a table using dplyr?

问题 When working with plyr I often found it useful to use adply for scalar functions that I have to apply to each and every row. e.g. data(iris) library(plyr) head( adply(iris, 1, transform , Max.Len= max(Sepal.Length,Petal.Length)) ) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len 1 5.1 3.5 1.4 0.2 setosa 5.1 2 4.9 3.0 1.4 0.2 setosa 4.9 3 4.7 3.2 1.3 0.2 setosa 4.7 4 4.6 3.1 1.5 0.2 setosa 4.6 5 5.0 3.6 1.4 0.2 setosa 5.0 6 5.4 3.9 1.7 0.4 setosa 5.4 Now I\'m using dplyr more,

Fastest way to add rows for missing time steps?

阅读更多关于 Fastest way to add rows for missing time steps?

问题 I have a column in my datasets where time periods ( Time ) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I\'d like to fill in those rows with NA . Below is example data for 1 (of several 1000) group(s). structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894, -0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c(\"Id\", \"Time\", \"Value\"), row.names = c(NA, 4L), class = \"data.frame\") Id Time

How to select the rows with maximum values in each group with dplyr? [duplicate]

阅读更多关于 How to select the rows with maximum values in each group with dplyr? [duplicate]

问题 This question already has answers here : How to select the row with the maximum value in each group (10 answers) Closed 7 months ago . I would like to select a row with maximum value in each group with dplyr. Firstly I generate some random data to show my question set.seed(1) df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5)) df$value <- runif(nrow(df)) In plyr, I could use a custom function to select this row. library(plyr) ddply(df, .(A, B), function(x) x[which.max(x$value),]) In dplyr, I

Convert data from long format to wide format with multiple measure columns

阅读更多关于 Convert data from long format to wide format with multiple measure columns

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I am having trouble figuring out the most elegant and flexible way to switch data from long format to wide format when I have more than one measure variable I want to bring along. For example, here\'s a simple data frame in long format. ID is the subject, TIME is a time variable, and X and Y are measurements made of ID at TIME : > my.df <- data.frame(ID=rep(c(\"A\",\"B\",\"C\")