Faster way to split a string and count characters using R?

后端 未结 6 856
太阳男子
太阳男子 2021-02-01 08:51

I\'m looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the let

6条回答
  •  忘掉有多难
    2021-02-01 09:17

    Thanks to all for this post,

    To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.

    In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!

    here is the C++ code:

    #include 
    using namespace Rcpp;
    // [[Rcpp::export]]
    float pGC_cpp(std::string s) {
      int count = 0;
    
      for (int i = 0; i < s.size(); i++) 
        if (s[i] == 'G') count++;
        else if (s[i] == 'C') count++;
    
      float pGC = (float)count / s.size();
      pGC = pGC * 100;
      return pGC;
    }
    

    Which I call from R typing:

    sourceCpp("pGC_cpp.cpp")
    pGC_cpp("ATGCCC")
    

提交回复
热议问题