Count number of occurrences of a string from a column inside another column, with conditions

问题

I would like to count the number of times the words from a string in column animals.1 occur in the column animals.2 within the past five years:

> df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
> df
       animals.1       animals.2 year
1 cat; dog; bird  cat; dog; bird 2001
2      dog; bird dog; bird; seal 2005
3           bird            bird 2010
4            dog                 2018

Desired Output

> df
       animals.1       animals.2 year count
1 cat; dog; bird  cat; dog; bird 2001     3
2      dog; bird dog; bird; seal 2005     4
3           bird            bird 2010     1
4            dog                 2018     0

Edit

In Row2 animal.1 = dog; bird, appearances in previous 5 years in column animal.2 = dog; bird (in 2005) and dog; bird (in 2001) . Total Count = 4

In Row3 animals.1 = bird, appearances in previous five years in column animal.2 = bird (in 2010), whereas year 2005 is outside my five year range. Total Count = 1

I have asked a similar question, only without the year condition, in a previous post. However, the year condition cannot be added to the solutions provided.

Any help would be appreciated :)

回答1:

Your code is not yet made to be machine readable. Machines are much better at reading data that is "long" and performing grouping and joining operations.

When you are looking for x %in% y you are performing lots of comparisons. Then performing string operations also slows you down (spliting a string has to find where to split the string). I would suggest converting all your data to long format and leaving it in long format until you need it in wide format for a human to look at. But I'm giving you the output in your format because the question asks for it.

Most of the code below is converting your data into a long data format. I've put a extra steps in the code to try to break-down what the data looks like going into the computation.

library(dplyr)
library(tidyr)
library(stringr)

df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)

# Convert the animal.1 column to long data
animals_1_long <- df %>%
  rowwise() %>%
  mutate(
    animals_1 = str_split(animals.1,"; ")
  ) %>%
  select(animals_1,year) %>%
  unnest()
# # A tibble: 7 x 2
#   year  animals_1
#  <chr> <chr>    
# 1 2001  cat      
# 2 2001  dog      
# 3 2001  bird     
# 4 2005  dog      
# 5 2005  bird     
# 6 2010  bird     
# 7 2018  dog 

# Similarly convert the animal.2 column to long data
animals_2_long <- df %>%
  rowwise() %>%
  mutate(
    animals_2 = str_split(animals.2,"; ")
  ) %>%
  select(animals_2,year) %>%
  unnest()

# Since we want to match for the last 5 years, create a match index for year-4 to year.
animals_2_long_extend_5yrs <- animals_2_long %>%
  rename(index_year = year) %>%
  rowwise() %>%
  mutate(match_year = list(as.character((as.numeric(index_year)-4):as.numeric(index_year)))) %>%
  unnest()
# # A tibble: 40 x 3
# index_year animals_2 match_year
#    <chr>      <chr>     <chr>     
# 1  2001       cat       1997      
# 2  2001       cat       1998      
# 3  2001       cat       1999      
# 4  2001       cat       2000      
# 5  2001       cat       2001      
# 6  2001       dog       1997      
# 7  2001       dog       1998      
# 8  2001       dog       1999      
# 9  2001       dog       2000      
# 10 2001       dog       2001

At this point the animal_1 data is in long format with one animal/year per row. The animal_2 data is in long format with one animal/match_year/index_year per row. This allows the second dataset to cover all of the last 5 years in a single join, but then be summed up to the year we are originally interested in.

Joining the two long datasets leaves only the rows where year matches match_year and the animal name matches. Then it is trivial to sum up the number of rows that are left in the index_year.

# Join the long data and the long data with the extended match index
animal_check <- animals_1_long %>%
  rename(match_year = year) %>%
  left_join(animals_2_long_extend_5yrs) %>%
  filter(animals_1 == animals_2) %>%
  # group by the index year and summarize the count
  group_by(index_year) %>%
  summarise(count = n()) %>%
  rename(year = index_year)
# # A tibble: 3 x 2
#   year  count
#   <chr> <int>
# 1 2001      3
# 2 2005      4
# 3 2010      1

At this point the calculation is done. All that is left is adding the count back to the data with the animals.

# Join the yearly result back to the original dataframe
df <- df %>%
  left_join(animal_check)
df
#        animals.1       animals.2 year count
# 1 cat; dog; bird  cat; dog; bird 2001     3
# 2      dog; bird dog; bird; seal 2005     4
# 3           bird            bird 2010     1
# 4            dog                 2018    NA

Update:

# Data for benchmark:
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), 
                animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""), 
                stringsAsFactors = F)

df <- replicate(10000,{df}, simplify=F) %>% do.call(rbind, .)
df$year <- as.character(seq(2000,2000 + nrow(df) - 1))
# microbenchmark results
      min       lq     mean   median       uq      max neval
 5.785196 5.950748 6.642028 6.981055 7.001854 7.491287     5

回答2:

A base way with mapply():

within(df,
  count <- mapply(function(x, y) {
    in5year <- paste(animals.2[year %in% (x-4):x], collapse = "; ")
    sum(strsplit(in5year, "; ")[[1]] %in% strsplit(y, "; ")[[1]])
  }, year, animals.1)
)

#        animals.1       animals.2 year count
# 1 cat; dog; bird  cat; dog; bird 2001     3
# 2      dog; bird dog; bird; seal 2005     4
# 3           bird            bird 2010     1
# 4            dog                 2018     0

I presume the year column is numeric. If not, please convert it to numeric first.

来源：https://stackoverflow.com/questions/54767043/count-number-of-occurrences-of-a-string-from-a-column-inside-another-column-wit

标签

count

unique