问题
I have a use case where I need to build a list from the lines in a file. This operation will be performed potentially 100s of times on a distributed network. I've been using the obvious solution of:
with open("file.txt") as f:
ds = f.readlines()
I just had the thought that perhaps I would be better off creating this list once, pickling it into a file and then using that file to unpickle the data on each node.
Would there be any performance increase if I did this?
回答1:
Would there be any performance increase if I did this?
Test it and see!
try:
import cPickle as pickle
except:
import pickle
import timeit
def lines():
with open('lotsalines.txt') as f:
return f.readlines()
def pickles():
with open('lotsalines.pickle', 'rb') as f:
return pickle.load(f)
ds = lines()
with open('lotsalines.pickle', 'wb') as f:
t = timeit.timeit(lambda: pickle.dump(ds, file=f, protocol=-1), number=1)
print('pickle.dump: {}'.format(t))
print('readlines: {}'.format(timeit.timeit(lines, number=10))
print('pickle.load: {}'.format(timeit.timeit(pickles, number=10))
My 'lotsalines.txt' file is just that source duplicated until it's 655360 lines long, or 15532032 bytes.
Apple Python 2.7.2:
readlines: 0.640027999878
pickle.load: 2.67698192596
And the pickle file is 19464748 bytes.
Python.org 3.3.0:
readlines: 1.5357899703085423
pickle.load: 1.5975534357130527
And it's 20906546 bytes.
So, Python 3 has sped up pickle quite a bit over Python 2, at least if you use pickle protocol 3, but it's still nowhere near as fast as a simple readlines. (And readlines has gotten a lot slower in 3.x, as well as being deprecated.)
But really, if you've got performance concerns, you should consider whether you need the list in the first place. A quick test shows that building a list of this size is almost half the cost of the readlines (timing list(range(655360)) in 3.x, list(xrange(655360)) in 2.x). And it uses a ton of memory (which is probably actually why it's slow, too). If you don't actually need the list—and usually you don't—just iterate over the file, getting lines as you need them.
来源:https://stackoverflow.com/questions/14900232/unpickle-a-data-structure-vs-build-by-calling-readlines