问题
I currently face the following issue.
I want to come up with an R code that creates a new column called, e.g., reviews_last30days
in my main dataframe listings
which should be able to count or accumulate all reviews for each unique listings$ID
.
The unique reviews per ID are listed in another dataframe like this:
REVIEWS
ID review_date
1 2015-12-30
1 2015-12-31
1 2016-10-27
2 2014-05-10
2 2016-10-19
2 2016-10-22
2 2016-10-23
I also need to add a date condition, e.g. such that only the last 30 days starting from the last_scrape
are considered.
Hence my result should look somewhat like the third column:(UPDATE: see EDIT for better description of intended result)
LISTINGS
ID last_scrape reviews_last30days
1 2016-11-15 1
2 2016-11-15 3
So finally, the column reviews_last30days
should count review_date
for each ID
since the indicated time frame of 30 days since the last_scape
.
I already formatted both date columns "as.Date" with "%Y-%m-%d".
Sorry if my problem might not be formulated clearly enough for you guys, it's quite hard to explain or visualize, but in terms of code it hopefully shouldn't be that complicated after all.
EDIT for clarification
Besides the input REVIEWS indicated above, I do have a second input dataframe, be it OVERVIEW, that currently looks somewhat like this in a simplified form:
OVERVIEW
ID last_scrape
1 2016-11-15
2 2016-11-15
3 2016-11-15
4 2017-01-15
5 2017-01-15
6 2017-01-15
7 2017-01-15
etc
So what I actually need is a code to count all entries of review_date
for which the ID
from OVERVIEW matches with the ID
in REVIEWS and the review_date
from REVIEWS is max 30 days from the last_scrape
in OVERVIEW.
The code should then ideally assign this newly calculated value as new column in OVERVIEW like this:
OVERVIEW
ID last_scrape rev_last30days
1 2016-11-15 1
2 2016-11-15 3
3 2016-11-15 ..
4 2017-01-15 ..
5 2017-01-15 ..
6 2017-01-15 ..
7 2017-01-15 ..
etc
#2 EDIT - hopefully my last ;)
Thanks for your help so far @mfidino! Plotting your latest code still results in one minor mistake, namely the following:
TOTALREV$review_date <- ymd(TOTALREV$review_date)
TOTALLISTINGS$last_scraped.calc <- ymd(TOTALLISTINGS$last_scraped.calc)
gen_listings <- function(review = NULL, overview = NULL){
# tibble to return
to_return <- review %>%
inner_join(., overview, by = 'listing_id') %>%
group_by(listing_id) %>%
summarise(last_scraped.calc = unique(last_scraped.calc),
reviews_last30days = sum(review_date >= (last_scraped.calc-30)))
return(to_return)
}
REVIEWCOUNT <- gen_listings(TOTALREV, TOTALLISTINGS)
Error: Column `last_scraped.calc` must be length 1 (a summary value), not 2
Do you have any idea how to fix this error?
NOTE: I used the names as in my original file, code should still be the same.
If it helps, some properties of the vector last_scraped.calc
:
$ last_scraped.calc : Date, format: "2018-08-07" "2018-08-07" ...
typeof(TOTALLISTINGS$last_scraped.calc)
[1] "double"
length(TOTALLISTINGS$last_scraped.calc)
[1] 549281
and
unique(TOTALLISTINGS$last_scraped.calc)
[1] "2018-08-07" "2019-01-13" "2018-08-15" "2019-01-16" "2018-08-14"
"2019-01-15" "2019-01-14" "2019-01-22" [9] "2018-08-22" "2018-08-21"
"2019-01-28" "2018-08-20" "2019-01-23" "2019-01-31" "2018-08-09"
"2018-08-10" [17] "2018-08-08" "2018-08-16"
Any further help much appreciated - thanks in advance!
回答1:
You can do this pretty easily with dplyr
. I am using lubridate::ymd()
here instead of as.Date()
as well.
library(lubridate)
library(dplyr)
REVIEWS <- data.frame(ID = c(1,1,1,2,2,2,2),
review_date = c("2015-12-30",
"2015-12-31",
"2016-10-27",
"2014-05-10",
"2016-10-19",
"2016-10-22",
"2016-10-23"))
REVIEWS$review_date <- ymd(REVIEWS$review_date)
LISTINGS <- REVIEWS %>% group_by(ID) %>%
summarise(last_scrape = max(review_date),
reviews_last30days = sum(review_date >= (max(review_date)-30)))
The output of LISTINGS
is your expected output:
# A tibble: 2 x 3
ID last_scrape reviews_last30days
<dbl> <date> <int>
1 1 2016-10-27 1
2 2 2016-10-23 3
EDIT:
If, instead, you are interested in letting last_scrape
be an input rather than the latest review date per group, you can modify the code as such. Assuming that there can be multiple last_scrape
per ID:
library(lubridate)
library(dplyr)
REVIEWS <- data.frame(ID = c(1,1,1,2,2,2,2),
review_date = c("2015-12-30",
"2015-12-31",
"2016-10-27",
"2014-05-10",
"2016-10-19",
"2016-10-22",
"2016-10-23"))
REVIEWS$review_date <- ymd(REVIEWS$review_date)
OVERVIEW <- data.frame(ID = rep(1:7, 2),
last_scrape = c("2016-11-15",
"2016-11-15",
"2016-11-15",
"2017-01-15",
"2017-01-15",
"2017-01-15",
"2017-01-15",
"2016-11-20",
"2016-11-20",
"2016-11-20",
"2017-01-20",
"2017-01-20",
"2017-01-20",
"2017-01-20"))
OVERVIEW$last_scrape <- ymd(OVERVIEW$last_scrape)
gen_listings <- function(review = NULL, overview = NULL){
# tibble to return
to_return <- review %>%
inner_join(., overview, by ='ID') %>%
group_by(ID, last_scrape) %>%
summarise(
reviews_last30days = sum(review_date >= (last_scrape-30)))
return(to_return)
}
LISTINGS <- gen_listings(REVIEWS, OVERVIEW)
The output of this LISTINGS
object is:
ID last_scrape reviews_last30days
<dbl> <date> <int>
1 1 2016-11-15 1
2 1 2016-11-20 1
3 2 2016-11-15 3
4 2 2016-11-20 2
回答2:
Similar to above answer...
REV %>% group_by(ID) %>%
mutate(rev_latest = max(review_date)) %>%
filter(rev_latest - review_date < 30) %>%
count(ID)
来源:https://stackoverflow.com/questions/56023458/is-there-an-r-function-mirroring-excel-countifs-with-date-range-as-condition