I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasi
This hints of the longest common subsequence problem. The problem with string similarity here is that you need to test against a continuous string of 230000 sequences; so if you are comparing one of your 25 sequences to the continuous string you'll get a very low similarity.
If you compute the longest common subsequence between your 25 sequences and the continuous string, you'll know if it is in the string if the lengths are the same.