Huge memory usage of loading large dictionaries in memory

前端 未结 4 2060
被撕碎了的回忆
被撕碎了的回忆 2020-12-04 08:54

I have a file on disk that\'s only 168MB. It\'s just a comma separated list of word,id. The word can be 1-5 characters long. There\'s 6.5 million lines.

I created a

4条回答
  •  忘掉有多难
    2020-12-04 09:24

    convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.

    the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.

    import bsddb
    a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
    for x in xrange(10500) :
      a['word%d' %x] = '%d' %x
    a.close()
    

    This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second). btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.

    With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.

提交回复
热议问题