问题
I am generating random DNA sequences in R where each sequence is of a set length and contains a user-specified distribution of nucleotides.
What I want to be able to do is ensure certain runs of nucleotides are NOT generated in a given sequence. The runs that are disallowed are: "aga", "agg", "taa", "tag" and "tga".
Here is my code that simply generates sequences where the above runs MAY occur. I am unsure how best to modify the code to account for the "tabu" runs specified above.
library(ape)
length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies
# DNA alphabet
nucl <- as.DNAbin(c('a', 'c', 'g', 't')) # A, C, G, T
# Randomly sample nucleotides
seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs)
I am thinking to simply list all the allowed runs which would be used in place of 'nucl' and specify 'size' = length.seqs / 3 within the sample() function, but this seems cumbersome, even with shortcuts like 'expand.grid()'.
回答1:
You could regex your way to it like this:
length.seqs <- 100 # length of DNA sequence
nucl.freqs <- rep(1/4, 4) # nucleotide frequencies
nucl <- c('a', 'c', 'g', 't') # A, C, G, T
seqs <- sample(nucl, size = length.seqs, replace = TRUE, prob = nucl.freqs)
bad_codons <- c("aga", "agg", "taa", "tag", "tga")
regx <- paste0("(", paste(bad_codons, collapse = ")|("), ")")
s <- paste(seqs, collapse = "")
while( grepl(regx, s) ) {
s <- gsub(regx,
paste(sample(nucl, size = 3, replace = TRUE, prob = nucl.freqs), collapse = ""),
s)
}
s
grepl(regex, s)
The idea is to replace the bad codons with fresh simulations until no more bad codons exist. If you need performance over long or lots of sequences this might not be a good route.
来源:https://stackoverflow.com/questions/61396134/generating-dna-codon-combinations-in-r