How to get pair-wise “sequence similarity score” for ~1000 proteins?

问题

I have a large number of protein sequences in fasta format.

I want to get the pair-wise sequence similarity score for each pairs of the proteins.

Any package in R could be used to get the blast similarity score for protein sequences?

回答1:

As per Chase's suggestion, bioconductor is indeed the way to go and in particular the Biostrings package. To install the latter I would suggest installing the core bioconductor library as such:

source("http://bioconductor.org/biocLite.R")
biocLite()

This way you will cover all dependencies. Now, to align 2 protein sequences or any two sequences for that matter you will need to use pairwiseAlignment from Biostrings. Given a fasta protseq.fasta file of 2 sequences that looks like this:

>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*

If you want to globally align these 2 sequences using lets say BLOSUM100 as your substitution matrix, 0 penalty for opening a gap and -5 for extending one then:

require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("~/Desktop/protseq.fasta", strip.descs=TRUE)
alm <- pairwiseAlignment(seqs[[1]]$seq, seqs[[2]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)

The result of this is (removed some of the alignment to save space):

> alm
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] MYRALRLLARSRPLVRA-PAAALAS....
subject: [1] M-R-------SRP---AGPALLLLL.... 
score: -91

To only extract the score for each alignment:

> score(alm)
[1] -91

Given this you can easily now do all pairwise alignments with some very simple looping logic. To get a better hang of pairwise alignment using bioconductor I suggest you read this.

An alternative approach would be to do a multiple sequence alignment instead of pairwise. You could use bio3d and from there the seqaln function to align all sequences in your fasta file.

回答2:

6 years later, but:

The protr package just came out, which has a parallelized pairwise similarity scoring function, parGOsim(). It can take lists of protein sequences, so a loop wouldn't be necessary to write.

来源：https://stackoverflow.com/questions/6529675/how-to-get-pair-wise-sequence-similarity-score-for-1000-proteins

标签

Sequence

similarity