Faster way to split a string and count characters using R?

后端未结

关注

 6  856

太阳男子 2021-02-01 08:51

I\'m looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the let

6条回答

忘掉有多难 (楼主)

2021-02-01 09:17
Thanks to all for this post,

To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.

In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!

here is the C++ code:
```
#include 
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
  int count = 0;

  for (int i = 0; i < s.size(); i++) 
    if (s[i] == 'G') count++;
    else if (s[i] == 'C') count++;

  float pGC = (float)count / s.size();
  pGC = pGC * 100;
  return pGC;
}
```
Which I call from R typing:
```
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")
```
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...