plyr

Web scraping looping through list of IDs and years in R

末鹿安然 提交于 2019-12-11 06:25:10
问题 I'm trying to scrape game logs of every MLB player dating back to 2000 from baseball-reference.com using R. I've read a ton of stuff that is helpful, but not exactly extensive enough for my purposes. The URL for say, Curtis Granderson's 2016 game logs is https://www.baseball-reference.com/players/gl.fcgi?id=grandcu01&t=b&year=2016. If I have a list of player IDs and years I know I should be able to loop through them somehow with a function similar to this one that grabs attendance by year:

Aggregating factor level counts - by factor

限于喜欢 提交于 2019-12-11 06:15:02
问题 I have been trying to make a table displaying the counts of factor levels by another factor. For this, I looked on dozens of pages, questions... trying to use functions in some packages (dplyr, reshape) to get the job done, without any success in using them correctly. That's what I got: # my data: var1 <- c("red","blue","red","blue","red","red","red","red","red","red","red","red","blue","red","blue") var2 <- c("0","1","0","0","0","0","0","0","0","0","1","0","0","0","0") var3 <- c("2","2","1",

How to get r.squared for each regression?

狂风中的少年 提交于 2019-12-11 05:27:56
问题 Im working with a huge data frame with structure similar to the followings. I use output_reg to store slope and intercept for each treatment but I need to add r.squared for each lm (y~x) and store it in another column besides the other two. Any hint on that? library(plyr) field <- c('t1','t1','t1', 't2', 't2','t2', 't3', 't3','t3') predictor <- c(4.2, 5.3, 5.4,6, 7,8.5,9, 10.1,11) response <- c(5.1, 5.1, 2.4,6.1, 7.7,5.5,1.99, 5.42,2.5) my_df <- data.frame(field, predictor, response,

Why is group_by -> filter -> summarise faster in R than pandas?

五迷三道 提交于 2019-12-11 05:13:21
问题 I am converting some of our older codes from R to python. In the process, have found pandas to be a bit slower than R. Interested in knowing if there is anything wrong I am doing. R Code (Taking around 2ms on my system): df = data.frame(col_a = sample(letters[1:3],20,T), col_b = sample(1:2,20,T), col_c = sample(letters[1:2],20,T), col_d = sample(c(4,2),20,T) ) microbenchmark::microbenchmark( a = df %>% group_by(col_a, col_b) %>% summarise( a = sum(col_c == 'a'), b = sum(col_c == 'b'), c = a/b

Total Mean & Mean by groups in R with dplyr

試著忘記壹切 提交于 2019-12-11 04:38:36
问题 Assume I have a dataset something like df <- data.frame(dive=factor(sample(c("dive1","dive2"),10,replace=TRUE)),speed=runif(10)) Now my goal is to find " Total mean of the data" and "Mean by Subgroups in R" in same data. So, I can say I should get something like # dive Total_Mean speed # 1 dive1 0.52 0.5790946 # 2 dive2 0.52 0.4864489 I am using a code df%>% summarise(avg=mean(speed))%>% group_by(dive)%>% summarise(Avg_group=mean(dive)) Which is wrong I know, So all I am seeking is how can I

Going from multi-core to multi-node in R

与世无争的帅哥 提交于 2019-12-11 03:54:53
问题 I've gotten accustomed to doing R jobs on a cluster with 32 cores per node. I am now on a cluster with 16 cores per node. I'd like to maintain (or improve) performance by using more than one node (as I had been doing) at a time. As can be seen from my dummy sell script and dummy function (below), parallelization on a single node is really easy. Is it similarly easy to extend this to multiple nodes? If so, how would I modify my scripts? R script: library(plyr) library(doMC) registerDoMC(16)

split a dataframe column by regular expression on characters separated by a “.”

a 夏天 提交于 2019-12-11 03:53:23
问题 In R, I have the following dataframe: Name Category 1 Beans 1.12.5 2 Pears 5.7.9 3 Eggs 10.6.5 What I would like to have is the following: Name Cat1 Cat2 Cat3 1 Beans 1 12 5 2 Pears 5 7 9 3 Eggs 10 6 5 Ideally some expression built inside plyr would be nice... I will investigate on my side but as searching this might take me a lot of time I was just wondering if some of you do have some hints to perform this... 回答1: I've written a function concat.split (a "family" of functions, actually) as

do.call to build and execute data.table commands

馋奶兔 提交于 2019-12-11 03:28:00
问题 I have a small data.table representing one record per test cell (AB testing results) and am wanting to add several more columns that compare each test cell, against each other test cell. In other words, the number of columns I want to add, will depend upon how many test cells are in the AB test in question. My data.table looks like: Group Delta SD.diff Control 0 0 Cell1 0.00200 0.001096139 Cell2 0.00196 0.001095797 Cell3 0.00210 0.001096992 Cell4 0.00160 0.001092716 And I want to add the

Computing multiple variance of a dataset in R

老子叫甜甜 提交于 2019-12-11 03:07:33
问题 My problem is somewhat related to this question. I have a data as below V1 V2 .. 1 .. 2 .. 1 .. 3 I need to calculate variance of data in V1 for each value of V2 cumulatively (This means that for a particular value of V2 say n ,all the rows of V1 having corresponding V2 less than n need to be included. Will ddply help in such a case? 回答1: I don't think ddply will help since it is built on the concept of taking non-overlapping subsets of a data frame. d <- data.frame(V1=runif(1000),V2=sample(1

how to use aaply and retain order of dimensions in array?

亡梦爱人 提交于 2019-12-11 02:46:50
问题 I have an array with 3 dimensions. I would like to apply a function to the 3rd dimension and return an array. I was very pleased that the plyr::aaply does nearly what I want. However it swaps around the dimensions of my array. The documentation told me that it is idempotent, which (after I'd looked it up) makes me think the structure should remain the same. Here's a reproducible example with the identity function. Can I modify this to retain the order of the array dimensions ? nRow <- 10 nCol