plyr | 易学教程

Finding the column number and value the of second highest value in a row

阅读更多关于 Finding the column number and value the of second highest value in a row

I am trying to write some code which identifies the greatest two values for each row and provides their column number and value. df = data.frame( car = c (2,1,1,1,0), bus = c (0,2,0,1,0), walk = c (0,3,2,0,0), bike = c(0,4,0,0,1)) I've managed to get it to do this for the maximum value using the max and max.col functions. df$max = max.col(df,ties.method="first") df$val = apply(df[ ,1:4], 1, max) As far as I know there are no equivalent functions for the second highest value so doing this has made things a little trickier. Using this code provides the second highest value but (importantly) not

R Dynamically build “list” in data.table (or ddply)

阅读更多关于 R Dynamically build “list” in data.table (or ddply)

My aggregation needs vary among columns / data.frames. I would like to pass the "list" argument to the data.table dynamically. As a minimal example: require(data.table) type <- c(rep("hello", 3), rep("bye", 3), rep("ok",3)) a <- (rep(1:3, 3)) b <- runif(9) c <- runif(9) df <- data.frame(cbind(type, a, b, c), stringsAsFactors=F) DT <-data.table(df) This call: DT[, list(suma = sum(as.numeric(a)), meanb = mean(as.numeric(b)), minc = min(as.numeric(c))), by= type] will have result similar to this: type suma meanb minc 1: hello 6 0.1332210 0.4265579 2: bye 6 0.5680839 0.2993667 3: ok 6 0.5694532 0

Population pyramid plot with ggplot2 and dplyr (instead of plyr)

阅读更多关于 Population pyramid plot with ggplot2 and dplyr (instead of plyr)

问题 I am trying to reproduce the simple population pyramid from the post Simpler population pyramid in ggplot2 using ggplot2 and dplyr (instead of plyr ). Here is the original example with plyr and a seed set.seed(321) test <- data.frame(v=sample(1:20,1000,replace=T), g=c('M','F')) require(ggplot2) require(plyr) ggplot(data=test,aes(x=as.factor(v),fill=g)) + geom_bar(subset=.(g=="F")) + geom_bar(subset=.(g=="M"),aes(y=..count..*(-1))) + scale_y_continuous(breaks=seq(-40,40,10),labels=abs(seq(-40

Subtract pairs of columns based on matching column

阅读更多关于 Subtract pairs of columns based on matching column

I'll apologise in advance - I know this has likely been answered elsewhere, but I don't seem to be able to find the answer I need, and can't manage to adapt other code I have found to my needs. I have a data frame: FILE | TECHNIQUE | COUNT ------------------------ A | ONE | 10 A | TWO | 25 B | ONE | 5 B | TWO | 30 C | ONE | 30 C | TWO | 50 I would like to produce a data frame of the difference of the COUNT values between ONE and TWO, with a row for each FILE, i.e. FILE | DIFFERENCE ----------------- A | 15 B | 25 C | 20 I'm convinced I should be able to do this fairly easily with base R or

Efficient multiplication of columns in a data frame

阅读更多关于 Efficient multiplication of columns in a data frame

I have a large data frame in which I am multiplying two columns together to get another column. At first I was running a for-loop, like so: for(i in 1:nrow(df)){ df$new_column[i] <- df$column1[i] * df$column2[i] } but this takes like 9 days. Another alternative was plyr , and I actually might be using the variables incorrectly: new_df <- ddply(df, .(column1,column2), transform, new_column = column1 * column2) but this is taking forever As Blue Magister said in comments, df$new_column <- df$column1 * df$column2 should work just fine. Of course we can never know for sure if we don't have an

Grouped correlation with dplyr (works only on console)

阅读更多关于 Grouped correlation with dplyr (works only on console)

问题 I'm trying to use dplyr to calculate grouped correlations, but something is clearly wrong since the code below works only in the console : require(dplyr) set.seed(123) xx = data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400)) gp = group_by(xx, group) summarize(gp, cor(a, b)) group cor(a, b) 1 1 -0.02073084 2 2 0.12803353 3 3 0.06236264 4 4 -0.06181904 If i use the same code in RStudio, i get: cor(a, b) 1 0.02739193 What's happening? 回答1: What you experience is related to having

Convert R list to dataframe with missing/NULL elements

阅读更多关于 Convert R list to dataframe with missing/NULL elements

Given a list: alist = list( list(name="Foo",age=22), list(name="Bar"), list(name="Baz",age=NULL) ) what's the best way to convert this into a dataframe with name and age columns, with missing values (I'll accept NA or "" in that order of preference)? Simple methods using ldply fail because it tries to convert each list element into a data frame, but the one with the NULL barfs because the lengths don't match. Best I have at the moment is: > ldply(alist,function(s){t(data.frame(unlist(s)))}) name age 1 Foo 22 2 Bar <NA> 3 Baz <NA> but that's pretty icky and the numeric variable becomes a factor

R resetting a cumsum to zero at the start of each year

阅读更多关于 R resetting a cumsum to zero at the start of each year

问题 I have a dataframe with a bunch of donations data. I take the data and arrange it in time order from oldest to most recent gifts. Next I add a column containing a cumulative sum of the gifts over time. The data has multiple years of data and I was looking for a good way to reset the cumsum to 0 at the start of each year (the year starts and ends July 1st for fiscal purposes). This is how it currently is: id date giftamt cumsum() 005 01-05-2001 20.00 20.00 007 06-05-2001 25.00 45.00 009 12-05

adding text to ggplot geom_jitter points that match a condition

阅读更多关于 adding text to ggplot geom_jitter points that match a condition

How can I add text to points rendered with geom_jittered to label them? geom_text will not work because I don't know the coordinates of the jittered dots. Could you capture the position of the jittered points so I can pass to geom_text? My practical usage would be to plot a boxplot with the geom_jitter over it to show the data distribution and I would like to label the outliers dots or the ones that match certain condition (for example the lower 10% for the values used for color the plots). One solution would be to capture the xy positions of the jittered plots and use it later in another

round_any equivalent for dplyr?

阅读更多关于 round_any equivalent for dplyr?

问题 I am trying to make a switch to the "new" tidyverse ecosystem and try to avoid loading the old packages from Wickham et al. I used to rely my coding previously. I found round_any function from plyr useful in many cases where I needed custom rounding for plots, tables, etc. E.g. x <- c(1.1, 1.0, 0.99, 0.1, 0.01, 0.001) library(plyr) round_any(x, 0.1, floor) # [1] 1.1 1.0 0.9 0.1 0.0 0.0 Is there an equivalent for round_any function from plyr package in tidyverse ? 回答1: ggplot::cut_width as