I have a file on disk that\'s only 168MB. It\'s just a comma separated list of word,id. The word can be 1-5 characters long. There\'s 6.5 million lines.
I created a
convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.
the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.
import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
a['word%d' %x] = '%d' %x
a.close()
This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second). btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.
With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.