If I have a list(or array, dictionary....) in python that could exceed the available memory address space, (32 bit python) what are the options and there relative speeds? (o
What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.
The answer is very much "it depends".
What are you storing in the lists? Strings? integers? Objects?
How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?
If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.
If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.
If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.
If you want random access writing then maybe a SQL database may be better.
Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.
edit: From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.
If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library array module as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.array
has very fast fromfile
and tofile
methods to facilitate the moving of data back and forth.
I.e., basically, assuming for example unsigned-long numbers, something like:
import os
# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)
class bigarray(object):
def __init__(self):
self.f = open('afile.dat', 'w+')
self.a = array.array('L')
def append(self, n):
self.a.append(n)
if len(self.a) > MAXINMEM:
self.a.tofile(self.f)
del self.a[:]
def pop(self):
if not len(self.a):
try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
except IOError: return self.a.pop() # ensure normal IndexError &c
try: self.a.fromfile(self.f, MAXINMEM)
except EOFError: pass
self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
self.f.truncate()
return self.a.pop()
Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend
, whatever), but if pop
and append
are indeed all you need this should serve.
You can try blist: https://pypi.python.org/pypi/blist/
The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists.
You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.
There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.
One possibility is to structure your data relationally, and store it in a SQLite database.