how to download complete genome sequence in biopython entrez.esearch

左心房为你撑大大i 提交于 2019-12-05 20:23:03

Your question is clear, but the full answer is long. The code I provide generates a .fasta file for each of your desired E.Coli genome sequences, yes only the "Complete Genomes" in NCBI.

You will see there are only six complete E.Coli reference genomes in NCBI (http://www.ncbi.nlm.nih.gov/genome/167):

To help you, here are the Genbank/Refseq links to their genomes:

  1. http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3

  2. http://www.ncbi.nlm.nih.gov/nuccore/NC_002695.1

  3. http://www.ncbi.nlm.nih.gov/nuccore/NC_011750.1

  4. http://www.ncbi.nlm.nih.gov/nuccore/NC_011751.1

  5. http://www.ncbi.nlm.nih.gov/nuccore/NC_017634.1

  6. http://www.ncbi.nlm.nih.gov/nuccore/NC_018658.1

Here is my code for Complete Genome Sequence Parsing into .FASTA files...

# Imports
from Bio import Entrez
from Bio import SeqIO

#############################
# Retrieve NCBI Data Online #
#############################

Entrez.email     = "asiak@wp.pl"             # Always tell NCBI who you are
genomeAccessions = ['NC_000913', 'NC_002695', 'NC_011750', 'NC_011751', 'NC_017634', 'NC_018658']
search           = " ".join(genomeAccessions)
handle           = Entrez.read(Entrez.esearch(db="nucleotide", term=search, retmode="xml"))
genomeIds        = handle['IdList']
records          = Entrez.efetch(db="nucleotide", id=genomeIds, rettype="gb", retmode="text")

###############################
# Generate Genome Fasta files #
###############################

sequences   = []  # store your sequences in a list
headers     = []  # store genome names in a list (db_xref ids)

for i,record in enumerate(records):

    file_out = open("genBankRecord_"+str(i)+".gb", "w")    # store each genomes .gb in separate files
    file_out.write(record.read())
    file_out.close()

    genomeGenbank   = SeqIO.read("genBankRecord"+str(i)+".gb", "genbank")  # parse in the genbank files
    header         = genome.features[0].qualifiers['db_xref'][0]          # name the genome using db_xfred ID
    sequence       = genome.seq.tostring()                                # obtain genome sequence

    headers.append('>'+header)  # store genome name in list                                     
    sequences.append(sequence)  # store sequence in list

    fasta_out = open("genome"+str(i)+".fasta","w")     # store each genomes .fasta in separate files
    fasta_out.write(header)    # >header ... followed by:
    fasta_out.write(sequence)  # sequence ... 
    fasta_out.close()          # close that .fasta file and move on to next genome
records.close()

Let me know how it goes! Andy

peterjc

You've done the hard part and worked out the query,

escherichia[orgn] AND complete genome[title]

So use that as the search query via Biopython as well!

from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
search_term = "escherichia[orgn] AND complete genome[title]"
handle = Entrez.esearch(db='nucleotide', term=search_term)
result = Entrez.read(handle)
handle.close()
print(result['Count']) # added parenthesis 

Currently that gives me 140 results, starting with 545778205, which is the same as the website: http://www.ncbi.nlm.nih.gov/nuccore/?term=escherichia%5Borgn%5D+AND+complete+genome%5Btitle%5D

This works for me...

search_term = 'escherichia coli[orgn] AND complete genome[title]'
handle = Entrez.esearch(db='nucleotide', term=search_term)
genome_ids = Entrez.read(handle)['IdList']

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")

    filename = 'generated/genBankRecord_{}.gb'.format(genome_id)
    print('Writing:{}'.format(filename))
    with open(filename, 'w') as f:
        f.write(record.read())
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!