plyr | 易学教程

Elegant way to solve ddply task with aggregate (hoping for better performance)

阅读更多关于 Elegant way to solve ddply task with aggregate (hoping for better performance)

问题 I would like to aggregate a data.frame by an identifier variable called ensg . The data frame looks like this: chromosome probeset ensg symbol XXA_00 XXA_36 XXB_00 1 X 4938842 ENSMUSG00000000003 Pbsn 4.796123 4.737717 5.326664 I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg . In the end I would like to have a

Create an “index” for each element of a group with data.table

阅读更多关于 Create an “index” for each element of a group with data.table

问题 My data is grouped by the IDs in V6 and ordered by position (V1:V3): dt V1 V2 V3 V4 V5 V6 1: chr1 3054233 3054733 . + ENSMUSG00000090025 2: chr1 3102016 3102125 . + ENSMUSG00000064842 3: chr1 3205901 3207317 . - ENSMUSG00000051951 4: chr1 3206523 3207317 . - ENSMUSG00000051951 5: chr1 3213439 3215632 . - ENSMUSG00000051951 6: chr1 3213609 3216344 . - ENSMUSG00000051951 7: chr1 3214482 3216968 . - ENSMUSG00000051951 8: chr1 3421702 3421901 . - ENSMUSG00000051951 9: chr1 3466587 3466687 . +

R Plyr - Ordering results from DDPLY?

阅读更多关于 R Plyr - Ordering results from DDPLY?

问题 Does anyone know a slick way to order the results coming out of a ddply summarise operation? This is what I'm doing to get the output ordered by descending depth. ddims <- ddply(diamonds, .(color), summarise, depth = mean(depth), table = mean(table)) ddims <- ddims[order(-ddims$depth),] With output... > ddims color depth table 7 J 61.88722 57.81239 6 I 61.84639 57.57728 5 H 61.83685 57.51781 4 G 61.75711 57.28863 1 D 61.69813 57.40459 3 F 61.69458 57.43354 2 E 61.66209 57.49120 Not too ugly,

Calculate correlation by aggregating columns of data frame

阅读更多关于 Calculate correlation by aggregating columns of data frame

问题 I have the following data frame: y <- data.frame(group = letters[1:5], a = rnorm(5) , b = rnorm(5), c = rnorm(5), d = rnorm(5) ) How to get a data frame which gives me the correlation between columns a,b and c,d for each row? something like: sapply(y, function(x) {cor(x[2:3],x[4:5])}) Thank you, S 回答1: You could use apply > apply(y[,-1],1,function(x) cor(x[1:2],x[3:4])) [1] -1 -1 1 -1 1 Or ddply (although this might be overkill, and if two rows have the same group it will do the correlation

How to populate parameters values present in rows of one dataframe(df1) to dataframe(df2) under same parameter field in R

阅读更多关于 How to populate parameters values present in rows of one dataframe(df1) to dataframe(df2) under same parameter field in R

问题 New to R, please guide ! Dataframe1 contain: df1 Col1 Col2 Col3 Col4 Col5 A=5 C=1 E=5 F=4 G=2 --Row1 A=6 B=3 D=6 E=4 F=4 --Row2 B=2 C=3 D=3 E=3 F=7 --Row3 Dataframe2 contain one row with each parameters as field names: df2 = A B C D E F g .....'n' Example Output (if values not found the null to be printed): df2: A B C D E F G 5 1 5 4 2 6 3 6 4 4 2 3 3 3 7 How to populate values of each parameter from df1 to df2 under same parameter which are present in first row as fields? 回答1: Create a row

Summary data tables from wide data.frames

阅读更多关于 Summary data tables from wide data.frames

问题 I am trying to find lazy/easy ways of creating summary tables/ data.frames from wide data.frames . Assume a following data.frame, but with many more columns so that specifying the column names takes a long time: set.seed(2) x <- data.frame(Rep = rep(1:3, 4), Temp = c(rep(10,6), rep(20,6)), pH = rep(c(rep(8.1, 3), rep(7.6, 3)), 2), Var1 = rnorm(12, 5,2), Var2 = c(rnorm(6,4,1), rnorm(6,3,5)), Var3 = rt(12, 20)) x[1:3] <- as.data.frame(apply(x[1:3], 2, function(x) as.factor(x))) Now I can

Normalize data by use of ratios based on a changing dataset in R

阅读更多关于 Normalize data by use of ratios based on a changing dataset in R

问题 I am trying to normalize a Y scale by converting all values to percentages. Therefore, I need to divide every number in a column by the first number in that column. In Excel, this would be equivalent to locking a cell A1/$A1, B1/$A1, C1/$A1 then D1/$D1, E1/$D1... The data needs to first meet four criteria (Time, Treatment, Concentration and Type) and the reference value changes at every new treatment. Each treatment has 4 concentrations (0, 0.1, 2 and 50). I would like for the values

Calculate the monthly returns with data.frames in R

阅读更多关于 Calculate the monthly returns with data.frames in R

问题 I want to calculate the monthly returns for a list of securities over a period of time. The data I have has the following structure: date name value "2014-01-31" a 10.0 "2014-02-28" a 11.1 "2014-03-31" a 12.1 "2014-04-30" a 11.9 "2014-05-31" a 11.5 "2014-06-30" a 11.88 "2014-01-31" b 6.0 "2014-02-28" b 8.5 "2014-03-31" b 8.2 "2014-04-30" b 8.8 "2014-05-31" b 8.3 "2014-06-30" b 8.9 The code I tried: database$date=as.Date(database$date) monthlyReturn<- function(df) { (df$value[2] - df$value[1])

Find proportion across categories, grouped by a second category using ddply

阅读更多关于 Find proportion across categories, grouped by a second category using ddply

问题 I want to find the percentage distribution of a numerical value across a given category, but grouped by a second category. For example, suppose I have a data frame with region , line_of_business , and sales , and I want to find the percentage of sales by line_of_business , grouped by region . I could do this with R's built-in aggregate and merge functions but I was curious if there was an shorter way to do this with plyr 's 'ddply function that avoids an explicit call to merge . 回答1: How

Lag in dataframe

阅读更多关于 Lag in dataframe

问题 I have a dataframe like ID_CASE Month CS00000026A 201301 CS00000026A 201302 CS00000026A 201303 CS00000026A 201304 CS00000026A 201305 CS00000026A 201306 CS00000026A 201307 CS00000026A 201308 CS00000026A 201309 CS00000026A 201310 CS00000191C 201302 CS00000191C 201303 CS00000191C 201304 CS00000191C 201305 CS00000191C 201306 CS00000191C 201307 CS00000191C 201308 CS00000191C 201309 CS00000191C 201310 I want the final data frame to have three additional column like ID_CASE Month Lag_1 Lag_2 Lag_3