How to handle huge sparse matrices construction using Scipy?

♀尐吖头ヾ 提交于 2019-12-07 15:29:34

问题


So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take. The files are preprocessed and hence are not in XML.
They are taken from http://haselgrove.id.au/wikipedia.htm and the format is:

from_page(1): to(12) to(13) to(14)..
from_page(2): to(21) to(22)..
.
.
.
from_page(5,700,000): to(xy) to(xz)

so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that makes it easier to store using scipy.lil.sparse or scipy.dok.sparse, now my issue is:

How on earth do I go about converting the .txt file with the link information to a sparse matrix? Read it and compute it as a normal N*N matrix then convert it or what? I have no idea.

Also, the links sometimes span across lines so what would be the correct way to handle that?
eg: a random line is like..

[
1: 2 3 5 64636 867
2:355 776 2342 676 232
3: 545 64646 234242 55455 141414 454545 43
4234 5545345 2423424545
4:454 6776
]

exactly like this: no commas & no delimiters.

Any information on sparse matrix construction and data handling across lines would be helpful.


回答1:


Scipy offers several implementations of sparse matrices. Each of them has its own advantages and disadvantages. You can find information about the matrix formats here:

There are several ways to get to your desired sparse matrix. Computing the full NxN matrix and then converting is probably not possible, due high memory requirements (about 10^12 entries!).

In your case I would prepare your data to construct a coo_matrix.

coo_matrix((data, (i, j)), [shape=(M, N)])

data[:] the entries of the matrix, in any order
i[:] the row indices of the matrix entries
j[:] the column indices of the matrix entries

You might also want to have a look at lil_matrix, which can be used to incrementally build your matrix.

Once you created the matrix you can then convert it to a better suited format for calculation, depending on your use case.

I do not recognize the data format, there might be parsers for it, there might not. Writing your own parser should not be very difficult, though. Each line containing a colon starts a new row, all indices after the colon and in consecutive lines without colons are the column entries for said row.



来源:https://stackoverflow.com/questions/26563211/how-to-handle-huge-sparse-matrices-construction-using-scipy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!