A very large square matrix powered by 2

问题

I have a very large square matrix of order around 570,000 x 570,000 and I want to power it by 2.

The data is in json format casting to associative array in array (dict inside dict in python) form

Let's say I want to represent this matrix:

[ [0, 0, 0],
  [1, 0, 5],
  [2, 0, 0] ]

In json it's stored like:

{"3": {"1": 2}, "2": {"1": 1, "3": 5}}

Which for example "3": {"1": 2} means the number in 3rd row and 1st column is 2.

I want the output to be the same as json, but powered by 2 (matrix multiplication)

The programming language isn't important. I want to calculate it the fastest way (less than 2 days, if possible)

So I tried to use Numpy in python (numpy.linalg.matrix_power), but it seems that it doesn't work with my nested unsorted dict format.

I wrote a simple python code to do that but I estimated that it would take 18 days to accomplish:

jsonFileName = "file.json"
def matrix_power(arr):
    result = {}
    for x1,subarray in arr.items():
        print("doing item:",x1)
        for y1,value1 in subarray.items():
            for x2,subarray2 in arr.items():
                if(y1 != x2):
                    continue
                for y2,value2 in subarray2.items():
                    partSum = value1 * value2
                    result[x1][y2] = result.setdefault(x1,{}).setdefault(y2,0) + partSum

    return result

import json
with open(jsonFileName, 'r') as reader: 
    jsonFile = reader.read()
    print("reading is succesful")
    jsonArr = json.loads(jsonFile)
    print("matrix is in array form")
    matrix = matrix_power(jsonArr)
    print("Well Done! matrix is powered by 2 now")
    output = json.dumps(matrix)
    print("result is in json format")
    writer = open("output.json", 'w+')
    writer.write(output)
    writer.close()
print("Task is done! you can close this window now")

Here, X1,Y1 is the row and col of the first matrix which then is multiplied by the corresponding element of the second matrix (X2,Y2).

回答1:

Numpy is not the problem, you need to input it on a format that numpy can understand, but since your matrix is really big, it probably won't fit in memory, so it's probably a good idea to use a sparse matrix (scipy.sparse.csr_matrix):

m = scipy.sparse.csr_matrix((
    [v for row in data.values() for v in row.values()], (
        [int(row_n) for row_n, row in data.items() for v in row],
        [int(column) for row in data.values() for column in row]
    )
))

Then it's just a matter of doing:

m**2

回答2:

now I have to somehow translate csr_matrix back to json serializable

Here's one way to do that, using the attributes data, indices, indptr - m is the csr_matrix:

d = {}
end = m.indptr[0]
for row in range(m.shape[0]):
  start = end
  end = m.indptr[row+1]
  if end > start:       # if row not empty
    d.update({str(1+row): dict(zip([str(1+i) for i in m.indices[start:end]], m.data[start:end]))})
output = json.dumps(d, default=int)

回答3:

I don't know how it can hold csr_matrix format but not in dictionary. d.update gives MemoryError after some time

Here's a variant which doesn't construct the whole output dictionary and JSON string in memory, but prints the individual rows directly to the output file; this should need considerably less memory.

#!/usr/bin/env python3
…
import json
import sys
sys.stdout = open("output.json", 'w')
delim = '{'
end = m.indptr[0]
for row in range(m.shape[0]):
  start = end
  end = m.indptr[row+1]
  if end > start:       # if row not empty
    print(delim, '"'+str(1+row)+'":',
          json.dumps(dict(zip([str(1+i) for i in m.indices[start:end]], m.data[start:end])), default=int)
         )
    delim = ','
print('}')

来源：https://stackoverflow.com/questions/57948945/a-very-large-square-matrix-powered-by-2

标签

python

numpy

matrix

optimization

memory-management