How to store this collection of documents?

问题

The dataset is like this:

39861    // number of documents
28102    // number of words of the vocabulary (another file)
3710420  // number of nonzero counts in the bag-of-words
1 118 1  // document_id index_in_vocabulary count
1 285 3
...
2 46 1
...
39861 27196 5

We are advised not to store that in matrix (of size 39861 x 39861 I guess), since it won't fit in memory^* and from here I can assume that every integer will need 24 bytes to be stored, thus 27 Gb (=39861*28102*24 bytes) with a dense matrix. So, which data structure should I use to store the dataset?

An array of lists?

If so( every list will have nodes with two data-members, the index_in_vocubulary and the count), just post a positive answer. If I assume that every document has on average 200 words, then the space would be:

no_of_documents x words_per_doc * no_of_datamembers * 24 = 39861*200*2*24 = 0.4 Gb

If not, which one would you propose (which would require less space)?

After storing the dataset, we are required to find k-Nearest Neighbors (k similar documents), with brute force and LSH.

*_{I have 3.8 GiB in my personal laptop, but I have access to a desktop with ~8Gb RAM.}

来源：https://stackoverflow.com/questions/36943793/how-to-store-this-collection-of-documents

标签

python

list

memory

document

bigdata

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!