tapply | 易学教程

Replace NA values with median by group

阅读更多关于 Replace NA values with median by group

问题 I have used the below tapply function to get the median of Age based on Pclass. Now how can I impute those median values to NA values based on Pclass? tapply(titan_train$Age, titan_train$Pclass, median, na.rm=T) 回答1: Here is another base R approach that uses replace and ave . df1 <- transform(df1, Age = ave(Age, Pclass, FUN = function(x) replace(x, is.na(x), median(x, na.rm = T)))) df1 # Pclass Age # 1 A 1 # 2 A 2 # 3 A 3 # 4 B 4 # 5 B 5 # 6 B 6 # 7 C 7 # 8 C 8 # 9 C 9 Same idea but using

What does the t in tapply stand for?

阅读更多关于 What does the t in tapply stand for?

问题 There seems to be general agreement that the l in "lapply" stands for list, the s in "sapply" stands for simplify and the r in "rapply" stands for recursively. But I could not find anything on the t in "tapply". I am now very curious. 回答1: Stands for table since tapply is the generic form of the table function. You can see this by comparing the following calls: x <- sample(letters, 100, rep=T) table(x) tapply(x, x, length) although obviously tapply can do more than counting. Also, some

What does the t in tapply stand for?

阅读更多关于 What does the t in tapply stand for?

R quantile by groups with assignments

阅读更多关于 R quantile by groups with assignments

问题 I have the following df: group = rep(seq(1,3),30) variable = runif(90, 5.0, 7.5) df = data.frame(group,variable) I need to i) Define quantile by groups, ii) Assign each person to her quantile with respect to her group. Thus, the output would look like: id group variable quantile_with_respect_to_the_group 1 1 6.430002 1 2 2 6.198008 3 ....... There is a complicated way to do it with loops and cut function over each groups but it is not efficient at all. Does someone know a better solution ?

Is it necessary to use factor to INDEX argument for tapply in r? [duplicate]

阅读更多关于 Is it necessary to use factor to INDEX argument for tapply in r? [duplicate]

问题 This question already has answers here : Grouping functions (tapply, by, aggregate) and the *apply family (9 answers) Closed 3 years ago . x #X Income Commute Job.Growth Physicians #1 A 26000 49.2 10.8 1987 #2 B 29300 45.3 9.5 517 #3 C 24800 39.8 8.2 592 #4 D 27900 46.8 7.6 3310 #5 E 37500 39.9 12.2 975 #6 A 26058 47.8 10.3 647 #7 B 33479 48.1 12.2 714 #8 C 28869 39.6 12.7 803 #9 D 37567 47.9 10.1 888 #10 E 30215 39.0 10.8 672 #11 A 38772 47.5 10.2 975 #12 B 34577 44.4 10.2 519 #13 C 39978 46

Relative frequency in r by factor

阅读更多关于 Relative frequency in r by factor

问题 I would like to get a table of top 10 absolute and relative frequencies for a variable across other factor variable. I have a dataframe with 3 columns: 1 column is a factor variable, 2nd is other variable I need to count, 3 is logical variable as a constraint. (real database has more than 4mln observations) dtf<-data.frame(c("a","a","b","c","b"),c("aaa","bbb","aaa","aaa","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE)) colnames(dtf)<-c("factor","var","log") dtf factor var log 1 a aaa TRUE 2 a bbb FALSE 3

How to perform t-tests for each level of a factor with tapply

阅读更多关于 How to perform t-tests for each level of a factor with tapply

问题 My data and code are like this: my_vector <- rnorm(150) my_factor1 <- gl(3,50) my_factor2 <- gl(2,75) tapply(my_vector, my_factor1, function(x) t.test(my_vector~my_factor2, paired=T)) I want to do a separate t-test for each level of my_factor1, to test my_vector for both levels of my_factor2. However, with my code the t-test is not splitting the levels of my_factor1, and the results are equal for each level because my_vector is entirely included in each t.test. This is the output of my code:

does the by( ) function make growing list

阅读更多关于 does the by( ) function make growing list

问题 Does the by function make a list that grows one element at a time? I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below: > # Make 4M rows of data > x = data.frame(col1=1:4000000, col2=10000001:14000000) > # Make a factor > x[,"f"] = x[,"col1"] - x[,"col1"] %% 5 > > head(x) col1 col2 f 1 1 10000001 0 2 2 10000002 0 3 3 10000003 0 4 4 10000004 0 5 5 10000005 5 6 6 10000006 5 Now, a tapply on one of the columns takes

Summarizing Latitude, Longitude, and Counts Data for ggplot Usage

阅读更多关于 Summarizing Latitude, Longitude, and Counts Data for ggplot Usage

问题 I have been provided with some customer data in Latitude, Longitude, and Counts format. All the data I need to create a ggplot heatmap is present, but I do not know how to put it into the format ggplot requires. I am trying to aggregate the data by total counts within 0.01 Lat and 0.01 Lon blocks (typical heatmap), and I instinctively thought "tapply". This creates a nice summary by block size, as desired, but the format is wrong. Furthermore, I would really like to have empty Lat or Lon

Mean of variable by two factors

阅读更多关于 Mean of variable by two factors

问题 I have the following data: a <- c(1,1,1,1,2,2,2,2) b <- c(2,4,6,8,2,3,4,1) c <- factor(c("A","B","A","B","A","B","A","B")) df <- data.frame( sp=a, length=b, method=c) I can use the following to get a count of the number of samples of each species by method: n <- with(df,tapply(sp,method,function(x) count(x))) How do I also get the mean length by method for each species? 回答1: Personally I would use aggregate : aggregate(length ~ sp, data = df, FUN= "mean" ) # by species only # sp length #1 1 5