How would I go about parsing a text file of thousands of DNA bases?

问题

Here's what I would have, I would have a massive text file of a bunch of dna bases (A, T, C, G) and what I would like to do is take every 60 characters (arbitrary) and put it on a new line so that way the bases get separated out in chunks. But, I would also like for there to be overlap of each chunk by a certain number of bases. For example, if this 10 letter chunk ATGGCTGCTA was given, and the initial 4 block chunk was ATGG, if there overlap parameter was specified to be 2, then the next 4 block chunk would be GGCT, then CTGC and so on. I know I'll probably have to look into reading, opening, and writing text files with python. If any has resources they could point me torwards on achieving this and any tips and instructions that would be great.

Example of the formatting of the text I would be working with:

https://www.ncbi.nlm.nih.gov/nuccore/NC_000017.11?report=fasta&from=7661779&to=7687550

回答1:

data = 'GAGACAGAGTCTCACTCTGTTGCACAGGCTGGAGTGCAGTGGCACAATCTCTGCTCACTGCAACCTCCTC'
chunk_size = 5
overlap = 2

for pos in range(0, len(data), chunk_size - overlap):
    print(data[pos:pos+chunk_size])

The results:

GAGAC
ACAGA
GAGTC
TCTCA
CACTC
TCTGT
...

来源：https://stackoverflow.com/questions/50845282/how-would-i-go-about-parsing-a-text-file-of-thousands-of-dna-bases

标签

python

parsing

formatting

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!