Huge memory usage of loading large dictionaries in memory

前端未结

关注

 4  2060

被撕碎了的回忆 2020-12-04 08:54

I have a file on disk that\'s only 168MB. It\'s just a comma separated list of word,id. The word can be 1-5 characters long. There\'s 6.5 million lines.

I created a

4条回答

忘掉有多难 (楼主)

2020-12-04 09:24
convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it.

the reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). you just created so many objects (6.5M) so the metadata becomes too huge.
```
import bsddb
a = bsddb.btopen('a.bdb') # you can also try bsddb.hashopen
for x in xrange(10500) :
  a['word%d' %x] = '%d' %x
a.close()
```
This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second). btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes.

With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion. So I think it's totally suitable for you.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...