I have to download only complete genome sequences from NCBI (GenBank(full) format). I am intrested in 'complete geneome' not 'whole genome'.
my script:
from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
gatunek='Escherichia[ORGN]'
handle = Entrez.esearch(db='nucleotide',
term=gatunek, property='complete genome' )#title='complete genome[title]')
result = Entrez.read(handle)
As a results I get only small fragments of genomes, whith size about 484 bp:
LOCUS NZ_KE350773 484 bp DNA linear CON 23-AUG-2013
DEFINITION Escherichia coli E1777 genomic scaffold scaffold9_G, whole genome
shotgun sequence.
I know how to do it manually via NCBI web site but it is very time consuming, the query that I use there:
escherichia[orgn] AND complete genome[title]
and as result I get multiple genomes with sizes range about 5,154,862 bp and this is what I need to do via ENTREZ.esearch.
Your question is clear, but the full answer is long. The code I provide generates a .fasta file for each of your desired E.Coli genome sequences, yes only the "Complete Genomes" in NCBI.
You will see there are only six complete E.Coli reference genomes in NCBI (http://www.ncbi.nlm.nih.gov/genome/167):

To help you, here are the Genbank/Refseq links to their genomes:
Here is my code for Complete Genome Sequence Parsing into .FASTA files...
# Imports
from Bio import Entrez
from Bio import SeqIO
#############################
# Retrieve NCBI Data Online #
#############################
Entrez.email = "asiak@wp.pl" # Always tell NCBI who you are
genomeAccessions = ['NC_000913', 'NC_002695', 'NC_011750', 'NC_011751', 'NC_017634', 'NC_018658']
search = " ".join(genomeAccessions)
handle = Entrez.read(Entrez.esearch(db="nucleotide", term=search, retmode="xml"))
genomeIds = handle['IdList']
records = Entrez.efetch(db="nucleotide", id=genomeIds, rettype="gb", retmode="text")
###############################
# Generate Genome Fasta files #
###############################
sequences = [] # store your sequences in a list
headers = [] # store genome names in a list (db_xref ids)
for i,record in enumerate(records):
file_out = open("genBankRecord_"+str(i)+".gb", "w") # store each genomes .gb in separate files
file_out.write(record.read())
file_out.close()
genomeGenbank = SeqIO.read("genBankRecord"+str(i)+".gb", "genbank") # parse in the genbank files
header = genome.features[0].qualifiers['db_xref'][0] # name the genome using db_xfred ID
sequence = genome.seq.tostring() # obtain genome sequence
headers.append('>'+header) # store genome name in list
sequences.append(sequence) # store sequence in list
fasta_out = open("genome"+str(i)+".fasta","w") # store each genomes .fasta in separate files
fasta_out.write(header) # >header ... followed by:
fasta_out.write(sequence) # sequence ...
fasta_out.close() # close that .fasta file and move on to next genome
records.close()
Let me know how it goes! Andy
You've done the hard part and worked out the query,
escherichia[orgn] AND complete genome[title]
So use that as the search query via Biopython as well!
from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
search_term = "escherichia[orgn] AND complete genome[title]"
handle = Entrez.esearch(db='nucleotide', term=search_term)
result = Entrez.read(handle)
handle.close()
print(result['Count']) # added parenthesis
Currently that gives me 140 results, starting with 545778205, which is the same as the website: http://www.ncbi.nlm.nih.gov/nuccore/?term=escherichia%5Borgn%5D+AND+complete+genome%5Btitle%5D
This works for me...
search_term = 'escherichia coli[orgn] AND complete genome[title]'
handle = Entrez.esearch(db='nucleotide', term=search_term)
genome_ids = Entrez.read(handle)['IdList']
for genome_id in genome_ids:
record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
filename = 'generated/genBankRecord_{}.gb'.format(genome_id)
print('Writing:{}'.format(filename))
with open(filename, 'w') as f:
f.write(record.read())
来源:https://stackoverflow.com/questions/18461629/how-to-download-complete-genome-sequence-in-biopython-entrez-esearch