plyr | 易学教程

Co-occurrence matrix using SAC?

阅读更多关于 Co-occurrence matrix using SAC?

问题 I have the following data frame 'x' id,item,volume a,c1,2 a,c2,3 a,c3,2 a,c4,1 a,c5,4 b,c6,6 b,c1,2 b,c3,1 b,c2,6 b,c4,4 c,c2,5 c,c8,6 c,c9,2 d,c1,1 e,c3,7 e,c2,3 e,c1,2 e,c9,5 e,c4,1 f,c1,7 f,c3,1 The first column is the id of a customer, the second column is the id of an item that customer bought and the third column is the number of those items bought. I'm trying to create a co-occurrence matrix which is a square matrix with 8 rows and columns, 8 being the number of distinct items. n =

Apply common function to all data frames and return data frames with same name

阅读更多关于 Apply common function to all data frames and return data frames with same name

问题 I'm trying to apply a function to all similarly spelled data frames in my global environment in R. I want to apply this function to all these data frames, but I can't figure out how to do it without me specifying 1 by 1. I want to return the data frame to the global environment with the same spelling as it was before. mtcars_test = mtcars iris_test = iris #....etc......could be 2 of them or 88 of them...but they will all end in "_test" # figure out what data frames I am working with list_of

Return rows establishing a “closest value to” in R

阅读更多关于 Return rows establishing a “closest value to” in R

问题 I have a data frame with different IDs and I want to make a subgroup in which: for each ID I will only obtain one row with the closest value to 0.5 in variable Y. This is my data frame: df <- data.frame(ID=c("DB1", "BD1", "DB2", "DB2", "DB3", "DB3", "DB4", "DB4", "DB4"), X=c(0.04, 0.10, 0.10, 0.20, 0.02, 0.30, 0.01, 0.20, 0.30), Y=c(0.34, 0.49, 0.51, 0.53, 0.48, 0.49, 0.49, 0.50, 1.0) ) This is what I want to get ID X Y DB1 0.10 0.49 DB2 0.10 0.51 DB3 0.30 0.49 DB4 0.20 0.50 I know I can add

ddply and spaces in quoted variables

阅读更多关于 ddply and spaces in quoted variables

问题 Is it possible to use spaces in ddply? I'm using data from a spreadsheet with a lot of spaces in column names and i would like to keep those names because later on I want to export this data with the same column names as the original. There are 200+ columns and using make.names will of course give me proper names but then I lose the original column names. However ddply doesn't seem to like spaces? Is there a workaround? lev=gl(2, 3, labels=c("low", "high")) df=data.frame(factor=lev, "fac tor"

Regression by subset in R [duplicate]

阅读更多关于 Regression by subset in R [duplicate]

问题 This question already has answers here : Linear Regression and group by in R (10 answers) Closed 3 years ago . I am new to R and am trying to run a linear regression on multiple subsets ("Cases") of data in a single file. I have 50 different cases, so I don't want to have to run 50 different regressions...be nice to automate this. I have found and experimented with the ddply method, but this, for some reason, returns the same coefficients to me for each case. Code I'm using is as follows:

How to create histogram in R with CSV time data?

阅读更多关于 How to create histogram in R with CSV time data?

问题 I have CSV data of a log for 24 hours that looks like this: svr01,07:17:14,'u1@user.de','8.3.1.35' svr03,07:17:21,'u2@sr.de','82.15.1.35' svr02,07:17:30,'u3@fr.de','2.15.1.35' svr04,07:17:40,'u2@for.de','2.1.1.35' I read the data with tbl <- read.csv("logs.csv") How can I plot this data in a histogram to see the number of hits per hour? Ideally, I would get 4 bars representing hits per hour per srv01, srv02, srv03, srv04. Thank you for helping me here! 回答1: An example dataset: dat = data

Transfer large MongoDB collections to data.frame in R with rmongodb and plyr

阅读更多关于 Transfer large MongoDB collections to data.frame in R with rmongodb and plyr

问题 I have some strange results with huge collections sets when trying to transfer as data frames from MongoDB to R with rmongodb and plyr packages. I pick up this code from various github and forums on the subject, and adapt it for my purposes : ## load the both packages library(rmongodb) library(plyr) ## connect to MongoDB mongo <- mongo.create(host="localhost") # [1] TRUE ## get the list of the databases mongo.get.databases(mongo) # list of databases (with mydatabase) ## get the list of the

Use ddply within a function and include variable of interest as an argument

阅读更多关于 Use ddply within a function and include variable of interest as an argument

问题 I am relatively new to R, and trying to use ddply & summarise from the plyr package. This post almost, but not quite, answers my question. I could use some additional explanation/clarification. My problem: I want to create a simple function to summarize descriptive statistics, by group, for a given variable. Unlike the linked post, I would like to include the variable of interest as an argument to the function. As has already been discussed on this site, this works: require(plyr) ddply(mtcars

R ggplot and facet grid: how to control x-axis breaks

阅读更多关于 R ggplot and facet grid: how to control x-axis breaks

问题 I am trying to plot the change in a time series for each calendar year using ggplot and I am having problems with the fine control of the x-axis. If I do not use scale="free_x" then I end up with an x-axis that shows several years as well as the year in question, like this: If I do use scale="free_x" then as one would expect I end up with tick labels for each plot, and that in some cases vary by plot, which I do not want: I have made various attempts to define the x-axis using scale_x_date

How does one aggregate and summarize data quickly?

阅读更多关于 How does one aggregate and summarize data quickly?

问题 I have a dataset whose headers look like so: PID Time Site Rep Count I want sum the Count by Rep for each PID x Time x Site combo on the resulting data.frame, I want to get the mean value of Count for PID x Time x Site combo. Current function is as follows: dummy <- function (data) { A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))}) B<-aggregate(Count~PID+Time+Site,data=A,mean) return (B) } This is painfully slow (original data.frame is 510000 20) . Is there a way