Alignment of multiple (non-biological, discrete state) sequences

问题

I have some data that describes an ordered set of discrete events (or states). There are 34 possible states, which may occur in any order and may repeat. Each sequence of events can contain any number of events, and crucially there are more than 2 sequences of events. My eventual aim is to cluster these sequences into similar subsets, but my hunch is that this cannot be meaningful unless these sequences are aligned such that equivalent events occupy the same position in all sequences.

I'm very familiar with multiple alignment of biological sequences, but all the software I've come across for this (MUSCLE, MAFFT, T-COFFEE, Clustal*, etc) require DNA, RNA or AA sequences, and I have more states than any of these, so I can't get them to work.

I've found various implementations of the pairwise alignment algorithms such as Needleman-Wunsch in R, but so far haven't come across any generic (non-biological) implementations of any multiple sequence alignment algorithms.

For example, say my data looks like this:

1: ABCDEFG
2: ACDGH
3: BDEFEGI
4: AH
5: DEGHI

My aim is to have it look like this:

1: ABCDEF-G--
2: A-CD---GH-
3: -B-DEFE--I
4: A-------H-
5: ---DE--GHI

Where the - symbol denotes the absence of an event in this sequence. This is a simplified example, in reality I'm looking for something that penalises the opening of gaps (-) in the same way that biological sequence MSA algorithms do.

The only piece of software I've found that seems to possibly do this is Alphamalig (http://alggen.lsi.upc.es/recerca/align/alphamalig/intro-alphamalig.html) but it's old and I can't get it working on my machine. Ideally I'd like something that can be implemented in R.

回答1:

I would advise using MAFFT sequence alignment. Typically, this is used to align biological sequences, but it has the option to align text using --anysymbol. Note that MAFFT is a bash script and requires an input/output file.

input file (mafft_anysymbol_input.txt):

>Seq1
ABCDEFG
>Seq2
ACDGH
>Seq3
BDEFEGI
>Seq4
AH
>Seq5
DEGHI

R code to run bash script:

#Be sure that input/output and R files share the same path, otherwise you'll have to specify the path in the mafft script call.
x <- 'mafft --anysymbol mafft_anysymbol_input.txt > mafft_anysymbol_output.txt'
system(x)

Contents of output file (mafft_anysymbol_output.txt):

>Seq1
ABCDEFG--
>Seq2
-ACDGH---
>Seq3
--BDEFEGI
>Seq4
----AH---
>Seq5
---DEGHI-

Edit - I see now that you are familiar with biological alignment tools. If you want to make a customized scoring matrix for your text alignments, check out mafft options --text and --textmatrix. It requires ascii code input (extra data type conversions), but you would have the option of associating similar letters (however you choose to define similar) by score. For example, you could associate upper and lowercase letters, or letters with/without accent marks.

回答2:

Assuming that we need to match with LETTERS, one option is str_match, then change the NA to -, paste

library(stringr)
library(dplyr)
f1 <- Vectorize(function(x) str_match(x, LETTERS))
out1 <- f1(v1)
do.call(paste0, as.data.frame(t(replace_na(out1[!!rowSums(!is.na(out1)),], '-'))))
#[1] "ABCDEFG--" "A-CD--GH-" "-B-DEFG-I" "A------H-" "---DE-GHI"

It can be also done with match after splitting

lst <- strsplit(v1, "")
mx <- match(max(sapply(lst, tail, 1)), LETTERS)
sapply(lst, function(x) paste(replace_na(x[match(LETTERS[seq_len(mx)], 
           x)], '-'), collapse=""))

data

v1 <- c("ABCDEFG", "ACDGH", "BDEFEGI", "AH", "DEGHI")

来源：https://stackoverflow.com/questions/55776078/alignment-of-multiple-non-biological-discrete-state-sequences

标签

sequence-alignment