Range join data.frames - specific date column with date ranges/intervals in R

后端 未结 2 1334
孤城傲影
孤城傲影 2020-12-15 13:06

Although the details of this are, of course, app specific, in the SO spirit I\'m trying to keep this as general as possible! The basic problem is how to merge data.frames by

2条回答
  •  无人及你
    2020-12-15 13:33

    Here's an approach using sqldf(...) from the sqldf package. This produces your result, with the following exceptions:

    1. The Member.n columns contain values in alphabetical order, rather than the order in which they appear in the History data frame. So Member.1 would contain c and Member.2 would contain f, rather than the other way around.
    2. Your result set has all the role-related columns as factors, whereas this result set has them as character. If it's important that can easily be changed.

    Note that Speeches and History are used for the input data frames, and I use your Output dataframe to get the columns' order only.

    library(sqldf)    # for sqldf(...)
    library(reshape2) # for dcast(...)
    
    colnames(History)[4:5] <- c("Start","End")   # sqldf doesn't like "." in colnames
    Speeches$id <- rownames(Speeches)            # need unique id column
    result <- sqldf("select a.id, a.Name, a.Date, b.Role, b.Value 
                    from Speeches a, History b 
                    where a.Name=b.Name and a.Date between b.Start and b.End")
    Roles <- aggregate(Role~Name+Date+id,result,function(x)
      ifelse(x=="Member",paste(x,1:length(x),sep="."),as.character(x)))$Role
    result$Roles <- unlist(Roles)
    result <- dcast(result,Name+Date+id~Roles,value.var="Value")
    result <- result[order(result$id),]   # re-order the rows
    result <- result[,colnames(Output)]   # re-order the columns
    

    Explanation

    • First, we need an id column in Speeches to differentiate between the replicated columns in the result. So we use the row names for that.
    • Second, we use sqldf(...) to merge the Speeches and History tables based on your criteria. Because you want dates to match based on a range, this may be the best approach.
    • Third, we have to convert multiple instances of "Member" into "Member.1", "Member.2", etc. We do this using aggregate(...) and paste(...).
    • Fourth, we have to convert the result of the sql, which is in "long" format (all Values in one column, distinguished by a second column Roles), into "wide" format, values for each Role in different columns. We do this using dcast(...).
    • Finally, we reorder the rows and columns to be consistent with your result.

提交回复
热议问题