可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Is it possible to write a C++ function that gets an R dataFrame as input, then modifies the dataFrame (in our case taking a subset) and returns the new data frame (in this question, returning a sub-dataframe) ? My code below may make my question more clear:
code:
# Suppose I have the data frame below created in R: myDF = data.frame(id = rep(c(1,2), each = 5), alph = letters[1:10], mess = rnorm(10)) # Suppose I want to write a C++ function that gets id as inout and returns # a sub-dataframe corresponding to that id (**If it's possible to return # DataFrame in C++**) # Auxiliary function --> helps get a sub vector: arma::vec myVecSubset(arma::vec vecMain, arma::vec IDVec, int ID){ arma::uvec AuxVec = find(IDVec == ID); arma::vec rslt = arma::vec(AuxVec.size()); for (int i = 0; i < AuxVec.size(); i++){ rslt[i] = vecMain[AuxVec[i]]; } return rslt; } # Here is my C++ function: Rcpp::DataFrame myVecSubset(Rcpp::DataFrame myDF, int ID){ arma::vec id = Rcpp::as<arma::vec>(myDF["id"]); arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]); arma::vec mess = Rcpp::as<arma::vec>(myDF["mess"]); // here I take a sub-vector: arma::vec id_sub = myVecSubset(id, id, int ID); arma::vec alph_sub = myVecSubset(alph, id, int ID); arma::vec mess_sub = myVecSubset(mess, id, int ID); // here is the CHALLENGE: How to combine these vectors into a new data frame??? ??? }
In summary, there are actually two main question: 1) Is there any better way to take the sub-dataframe above in C++? (wish I could simple say myDF[myDF$id == ID,]!!!)
2) Is there anyway that I can combine id_sub, alpha_sub, and mess_sub into an R data frame and return it?
I really appreciate your help.
回答1:
You don't need Rcpp
and RcppArmadillo
for that, you can just use R's subset
or perhaps dplyr::filter
. This is likely to be more efficient than your code that has to deep copy data from the data frame into armadillo vectors, create new armadillo vectors, and then copy these back into R vectors so that you can build the data frame. This produces lots of waste. Another source of waste is that you find
three times the same exact thing
Anyway, to answer your question, just use DataFrame::create
.
DataFrame::create( _["id"] = id_sub, _["alpha"] = alph_dub, _["mess"] = mess_sub ) ;
Also, note that in your code, alpha
will be a factor, so arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]);
is not likely to do what you want.
回答2:
To add on to Romain's answer, you can try calling the [
operator through Rcpp. If we understand how df[x, ]
is evaluated (ie, it's really a call to "[.data.frame"(df, x, R_MissingArg)
this is easy to do:
#include <Rcpp.h> using namespace Rcpp; Function subset("[.data.frame"); // [[Rcpp::export]] DataFrame subset_test(DataFrame x, IntegerVector y) { return subset(x, y, R_MissingArg); } /*** R df <- data.frame(x=1:3, y=letters[1:3]) subset_test(df, c(1L, 2L)) */
gives me
> df <- data.frame(x=1:3, y=letters[1:3]) > subset_test(df, c(1L, 2L)) x y 1 1 a 2 2 b
Callbacks to R can generally be slower in Rcpp, but depending on how much of a bottleneck this is it could still be fast enough for you.
Be careful though, as this function will use 1-based subsetting rather than 0-based subsetting for integer vectors.
回答3:
Here is a complete test file. It does not need your extractor function and just re-assembles the subsets -- but for that it needs the very newest Rcpp as currently on GitHub where Kevin happens to have added some work on subset indexing which is just what we need here:
#include <Rcpp.h> /*** R ## Suppose I have the data frame below created in R: ## NB: stringsAsFactors set to FALSE ## NB: setting seed as well set.seed(42) myDF <- data.frame(id = rep(c(1,2), each = 5), alph = letters[1:10], mess = rnorm(10), stringsAsFactor=FALSE) */ // [[Rcpp::export]] Rcpp::DataFrame extract(Rcpp::DataFrame D, Rcpp::IntegerVector idx) { Rcpp::IntegerVector id = D["id"]; Rcpp::CharacterVector alph = D["alph"]; Rcpp::NumericVector mess = D["mess"]; return Rcpp::DataFrame::create(Rcpp::Named("id") = id[idx], Rcpp::Named("alpha") = alph[idx], Rcpp::Named("mess") = mess[idx]); } /*** R extract(myDF, c(2,4,6,8)) */
With that file, we get the expected result:
R> library(Rcpp) R> sourceCpp("/tmp/sepher.cpp") R> ## Suppose I have the data frame below created in R: R> ## NB: stringsAsFactors set to FALSE R> ## NB: setting seed as well R> set.seed(42) R> myDF <- data.frame(id = rep(c(1,2), each = 5), + alph = letters[1:10], + mess = rnorm(10), + .... [TRUNCATED] R> extract(myDF, c(2,4,6,8)) id alpha mess 1 1 c 0.363128 2 1 e 0.404268 3 2 g 1.511522 4 2 i 2.018424 R> R> packageDescription("Rcpp")$Version ## unreleased version [1] "0.11.1.1" R>
I just needed something similar a few weeks ago (but not involving character vectors) and used Armadillo with its elem()
functions using an unsigned int
vector as index.