问题
I would like to count the number of times the words from a string in column animals.1 occur in the column animals.2 within the past five years:
> df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
> df
animals.1 animals.2 year
1 cat; dog; bird cat; dog; bird 2001
2 dog; bird dog; bird; seal 2005
3 bird bird 2010
4 dog 2018
Desired Output
> df
animals.1 animals.2 year count
1 cat; dog; bird cat; dog; bird 2001 3
2 dog; bird dog; bird; seal 2005 4
3 bird bird 2010 1
4 dog 2018 0
Edit
In Row2 animal.1
= dog; bird
, appearances in previous 5 years in column animal.2
= dog; bird
(in 2005) and dog; bird
(in 2001) . Total Count = 4
In Row3 animals.1
= bird
, appearances in previous five years in column animal.2
= bird
(in 2010), whereas year 2005 is outside my five year range. Total Count = 1
I have asked a similar question, only without the year condition, in a previous post. However, the year condition cannot be added to the solutions provided.
Any help would be appreciated :)
回答1:
Your code is not yet made to be machine readable. Machines are much better at reading data that is "long" and performing grouping and joining operations.
When you are looking for x %in% y
you are performing lots of comparisons. Then performing string operations also slows you down (spliting a string has to find where to split the string). I would suggest converting all your data to long format and leaving it in long format until you need it in wide format for a human to look at. But I'm giving you the output in your format because the question asks for it.
Most of the code below is converting your data into a long data format. I've put a extra steps in the code to try to break-down what the data looks like going into the computation.
library(dplyr)
library(tidyr)
library(stringr)
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
# Convert the animal.1 column to long data
animals_1_long <- df %>%
rowwise() %>%
mutate(
animals_1 = str_split(animals.1,"; ")
) %>%
select(animals_1,year) %>%
unnest()
# # A tibble: 7 x 2
# year animals_1
# <chr> <chr>
# 1 2001 cat
# 2 2001 dog
# 3 2001 bird
# 4 2005 dog
# 5 2005 bird
# 6 2010 bird
# 7 2018 dog
# Similarly convert the animal.2 column to long data
animals_2_long <- df %>%
rowwise() %>%
mutate(
animals_2 = str_split(animals.2,"; ")
) %>%
select(animals_2,year) %>%
unnest()
# Since we want to match for the last 5 years, create a match index for year-4 to year.
animals_2_long_extend_5yrs <- animals_2_long %>%
rename(index_year = year) %>%
rowwise() %>%
mutate(match_year = list(as.character((as.numeric(index_year)-4):as.numeric(index_year)))) %>%
unnest()
# # A tibble: 40 x 3
# index_year animals_2 match_year
# <chr> <chr> <chr>
# 1 2001 cat 1997
# 2 2001 cat 1998
# 3 2001 cat 1999
# 4 2001 cat 2000
# 5 2001 cat 2001
# 6 2001 dog 1997
# 7 2001 dog 1998
# 8 2001 dog 1999
# 9 2001 dog 2000
# 10 2001 dog 2001
At this point the animal_1 data is in long format with one animal/year per row. The animal_2 data is in long format with one animal/match_year/index_year per row. This allows the second dataset to cover all of the last 5 years in a single join, but then be summed up to the year we are originally interested in.
Joining the two long datasets leaves only the rows where year matches match_year and the animal name matches. Then it is trivial to sum up the number of rows that are left in the index_year.
# Join the long data and the long data with the extended match index
animal_check <- animals_1_long %>%
rename(match_year = year) %>%
left_join(animals_2_long_extend_5yrs) %>%
filter(animals_1 == animals_2) %>%
# group by the index year and summarize the count
group_by(index_year) %>%
summarise(count = n()) %>%
rename(year = index_year)
# # A tibble: 3 x 2
# year count
# <chr> <int>
# 1 2001 3
# 2 2005 4
# 3 2010 1
At this point the calculation is done. All that is left is adding the count back to the data with the animals.
# Join the yearly result back to the original dataframe
df <- df %>%
left_join(animal_check)
df
# animals.1 animals.2 year count
# 1 cat; dog; bird cat; dog; bird 2001 3
# 2 dog; bird dog; bird; seal 2005 4
# 3 bird bird 2010 1
# 4 dog 2018 NA
Update:
# Data for benchmark:
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"),
animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),
stringsAsFactors = F)
df <- replicate(10000,{df}, simplify=F) %>% do.call(rbind, .)
df$year <- as.character(seq(2000,2000 + nrow(df) - 1))
# microbenchmark results
min lq mean median uq max neval
5.785196 5.950748 6.642028 6.981055 7.001854 7.491287 5
回答2:
A base
way with mapply()
:
within(df,
count <- mapply(function(x, y) {
in5year <- paste(animals.2[year %in% (x-4):x], collapse = "; ")
sum(strsplit(in5year, "; ")[[1]] %in% strsplit(y, "; ")[[1]])
}, year, animals.1)
)
# animals.1 animals.2 year count
# 1 cat; dog; bird cat; dog; bird 2001 3
# 2 dog; bird dog; bird; seal 2005 4
# 3 bird bird 2010 1
# 4 dog 2018 0
I presume the year
column is numeric. If not, please convert it to numeric first.
来源:https://stackoverflow.com/questions/54767043/count-number-of-occurrences-of-a-string-from-a-column-inside-another-column-wit