问题
I have a huge dataframe df1, whose oversimplified version consists of 3 columns, "Words", "Frequency" and "Letters":
Words Frequency Letters
flower/tree 0.15 a(0.1)
tree 0.67 a(0.4)
planet 0.85 b(0.4)
tree/planet 0.42 c(0.5)
tree 0.89 a(0.6)
flower 0.21 b(0.4)
flower/planet 0.53 b
planet 0.07 a
Using R (dplyr, apply family functions, etc.) I would like to count the number of times every letter (a, b, c) of the "Letter" column is associated with every single word from the "Word" column (flower, tree, planet), in an iterative way dependent on the frequency bin of the "Frequency" column values. There are 4 bins: [0, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1].
I expect an output dataframe df2 that looks something like this:
Bin Word Letters count_letters
0-0.25 flower a 1
0-0.25 flower b 1
0-0.25 tree a 1
0-0.25 planet a 1
0.25-0.5 tree c 1
0.25-0.5 planet c 1
0.5-0.75 flower b 1
0.5-0.75 tree a 1
0.5-0.75 planet b 1
0.75-1 tree a 1
0.75-1 planet b 1
回答1:
You can use cut
to bin Frequency
, substr
to clean Letters
, and tidyr::separate_rows
to unnest Word
. Aggregate with dplyr::count
, and you're set:
library(tidyverse)
df %>% separate_rows(Words) %>%
count(Words,
Letters = substr(Letters, 1, 1), # use regex if more than one letter
Frequency = cut(Frequency, breaks = seq(0, 1, .25)))
## Source: local data frame [11 x 4]
## Groups: Frequency, Words [?]
##
## Frequency Words Letters n
## <fctr> <chr> <chr> <int>
## 1 (0,0.25] flower a 1
## 2 (0,0.25] flower b 1
## 3 (0,0.25] planet a 1
## 4 (0,0.25] tree a 1
## 5 (0.25,0.5] planet c 1
## 6 (0.25,0.5] tree c 1
## 7 (0.5,0.75] flower b 1
## 8 (0.5,0.75] planet b 1
## 9 (0.5,0.75] tree a 1
## 10 (0.75,1] planet b 1
## 11 (0.75,1] tree a 1
来源:https://stackoverflow.com/questions/42237800/count-specific-characters-from-column-associated-with-dual-categories-of-other-c