Slice a string at consecutive indices with R / Rcpp?

旧城冷巷雨未停 提交于 2019-12-19 03:36:39

问题


I want to write a function that slices a 'string' into a vector, sequentially, at a given index. I have a fairly adequate R solution for it; however, I figure that writing the code in C/C++ would likely be faster. For example, I'd like to be able to write a function 'strslice' that operates as follows:

x <- "abcdef"
strslice( x, 2 ) ## should return c("ab", "cd", "ef")

However, I'm not sure how to handle treating elements of the 'CharacterVector' passed around in the Rcpp code as strings. This is what I imagine might work (given my lack of C++/Rcpp knowledge I'm sure there's a better approach):

f <- rcpp( signature(x="character", n="integer"), '
  std::string myString = Rcpp::as<std::string>(x);
  int cutpoint = Rcpp::as<int>(n);
  vector<std::string> outString;
  int len = myString.length();
  for( int i=0; i<len/n; i=i+n ) {
    outString.push_back( myString.substr(i,i+n-1 ) );
    myString = myString.substr(i+n, len-i*n);
  }
  return Rcpp::wrap<Rcpp::CharacterVector>( outString );
  ')

For the record, the corresponding R code I have is:

strslice <- function(x, n) {
  x <- as.data.frame( stringsAsFactors=FALSE, 
                      matrix( unlist( strsplit( x, "" ) ), ncol=n, byrow=T )
  )

  do.call( function(...) { paste(..., sep="") }, x )

}

...but I figure jumping around between data structures so much will slow things down with very large strings.

(Alternatively: is there a way to coerce 'strsplit' into behaving as I want?)


回答1:


I would use substring. Something like this:

strslice <- function( x, n ){   
    starts <- seq( 1L, nchar(x), by = n )
    substring( x, starts, starts + n-1L )
}
strslice( "abcdef", 2 )
# [1] "ab" "cd" "ef"

About your Rcpp code, maybe you can allocate the std::vector<std::string> with the right size, so that you avoid resizing it which might mean memory allocations, ... or perhaps directly use a Rcpp::CharacterVector. Something like this:

strslice_rcpp <- rcpp( signature(x="character", n="integer"), '
    std::string myString = as<std::string>(x);
    int cutpoint = as<int>(n);
    int len = myString.length();
    int nout = len / cutpoint ;
    CharacterVector out( nout ) ;
    for( int i=0; i<nout; i++ ) {
      out[i] = myString.substr( cutpoint*i, 2 ) ;
    }
    return out ;
')
strslice_rcpp( "abdcefg", 2 )
# [1] "ab" "cd" "ef"



回答2:


This one-liner using strapplyc from the gsubfn package is fast enough that rcpp may not be needed. Here we apply it to the entire text of James Joyce's Ulysses which only takes a few seconds:

library(gsubfn)
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt") 
joycec <- paste(joyce, collapse = " ") # all in one string 
n <- 2
system.time(s <- strapplyc(joycec, paste(rep(".", n), collapse = ""))[[1]])


来源:https://stackoverflow.com/questions/13319858/slice-a-string-at-consecutive-indices-with-r-rcpp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!