Generating DNA codon combinations in R

好久不见. 提交于 2020-05-16 04:27:09

问题


I am generating random DNA sequences in R where each sequence is of a set length and contains a user-specified distribution of nucleotides.

What I want to be able to do is ensure certain runs of nucleotides are NOT generated in a given sequence. The runs that are disallowed are: "aga", "agg", "taa", "tag" and "tga".

Here is my code that simply generates sequences where the above runs MAY occur. I am unsure how best to modify the code to account for the "tabu" runs specified above.

library(ape)

length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies

# DNA alphabet
nucl <- as.DNAbin(c('a', 'c', 'g', 't')) # A, C, G, T

# Randomly sample nucleotides
seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs) 

I am thinking to simply list all the allowed runs which would be used in place of 'nucl' and specify 'size' = length.seqs / 3 within the sample() function, but this seems cumbersome, even with shortcuts like 'expand.grid()'.


回答1:


You could regex your way to it like this:

length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies
nucl <- c('a', 'c', 'g', 't') # A, C, G, T

seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs)

bad_codons <- c("aga", "agg", "taa", "tag", "tga")

regx <- paste0("(", paste(bad_codons, collapse = ")|("), ")")

s <- paste(seqs, collapse = "")

while( grepl(regx, s) ) {
  s <- gsub(regx,
            paste(sample(nucl, size = 3, replace = TRUE, prob = nucl.freqs), collapse = ""),
            s)
}

s
grepl(regex, s)

The idea is to replace the bad codons with fresh simulations until no more bad codons exist. If you need performance over long or lots of sequences this might not be a good route.



来源:https://stackoverflow.com/questions/61396134/generating-dna-codon-combinations-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!