问题

Problem

I was working on the problem described here. I have two goals.

For any given system of linear equations, figure out which variables have unique solutions.
For those variables with unique solutions, return the minimal list of equations such that knowing those equations determines the value of that variable.

For example, in the following set of equations

X = a + b
Y = a + b + c
Z = a + b + c + d

The appropriate output should be c and d, where X and Y determine c and Y and Z determine d.

Parameters

I'm provided a two columns pandas DataFrame entitled InputDataSet where the two columns are Equation and Variable. Each row represents a variable's membership in a given equation. For example, the above set of equations would be represented as

InputDataSet = pd.DataFrame([['X','a'],['X','b'],['Y','a'],['Y','b'],['Y','c'], 
['Z','a'],['Z','b'],['Z','c'],['Z','d']],columns=['Equation','Variable'])

The output will be stored in a 2 column DataFrame named OutputDataSet as well, where the first contains the variables that have unique solution, and the second is a comma delimited string of the minimal set of equations needed to solve the given variable. For example, the correct OutputDataSet would look like

OutputDataSet = pd.DataFrame([['c','X,Y'],['d','Y,Z']],columns=['Variable','EquationList'])

Current Solution

My current solution takes the InputDataSet and converts it into a NetworkX graph. After splitting the graph into connected subgraphs, it then converts the graph into a biadjacency matrix (since the graph by nature is bipartite). After this conversion, the SVD is computed, and the nullspace and pseudoinverse are calculated from the SVD (To see how they are calculated, see here and here: look at the source code for numpy.linalg.pinv and the cookbook function for nullspace. I fused the two functions since they both use SVD).

After calculating nullspace and pseudo-inverse, and rounding to a given tolerance, I find all rows in the nullspace where all of the coefficients are 0, and return those variables as those with a unique solution, and return those equations with non-zero coefficients for those variables in the pseudo-inverse.

Here is the code:

import networkx as nx
import pandas as pd
import numpy as np
import numpy.core as cr
def svd_lite(a, tol=1e-2):
    wrap = getattr(a, "__array_prepare__", a.__array_wrap__)
    rcond = cr.asarray(tol)
    a = a.conjugate()
    u, s, vt = np.linalg.svd(a)
    nnz = (s >= tol).sum()
    ns = vt[nnz:].conj().T
    shape = a.shape
    if shape[0]>shape[1]:
        u = u[:,:shape[1]]
    elif shape[1]>shape[0]:
        vt = vt[:shape[0]]
    cutoff = rcond[..., cr.newaxis] * cr.amax(s, axis=-1, keepdims=True)
    large = s > cutoff
    s = cr.divide(1, s, where=large, out=s)
    s[~large] = 0
    res = cr.matmul(cr.swapaxes(vt, -1, -2), cr.multiply(s[..., cr.newaxis], 
cr.swapaxes(u, -1, -2)))
    return (wrap(res),ns)
cols = InputDataSet.columns
tolexp=2
graphs = nx.connected_component_subgraphs(nx.from_pandas_dataframe(InputDataSet,cols[0],
cols[1]))
OutputDataSet = []
Eqs = InputDataSet[cols[0]].unique()
Vars = InputDataSet[cols[1]].unique()
for i in graphs:
    EqList = np.array([val for val in np.array(i.nodes) if val in Eqs])
    VarList = [val for val in np.array(i.nodes) if val in Vars]
    pinv,nulls = svd_lite(nx.bipartite.biadjacency_matrix(i,EqList,VarList,format='csc')
    .astype(float).todense(),tol=10**-tolexp)
    df2 = np.where(~np.round(nulls,tolexp).any(axis=1))[0]
    df3 = np.round(np.array(pinv),tolexp)
    OutputDataSet.extend([[VarList[i],",".join(EqList[np.nonzero(df3[i])])] for i in df2])
OutputDataSet = pd.DataFrame(OutputDataSet)

Issues

On the data that I've tested this algorithm on, it performs pretty well with decent execution time. However, the main issue is that it suggests far too many equations as required to determine a given variable.

Often, with datasets of 10,000 equations, the algorithm will claim that 8,000 of those 10,000 are required to determine a given variable, which most definitely is not the case.

I tried raising the tolerance (what I round the coefficients in the pseudo-inverse) to .1, but even then, nearly 5000 equations had non-zero coefficients.

I had conjectured that perhaps the pseudo-inverse is collapsing upon a non-optimal set of coefficients, but the Moore-Penrose pseudoinverse is unique, so that isn't a possibility.

Am I doing something wrong here? Or is the approach I'm taking not going to give me what I desire?

Further Notes

All of the coefficients of all of the variables are 1
The results the current algorithm is producing are reliable ... When I multiply any vector of equation totals by the pseudoinverse generated by the algorithm, I get values essentially equal to those claimed to have a unique solution, which is promising.
What I want to know here is either whether I'm doing something wrong in how I'm extrapolating information from the pseudo-inverse, or whether my approach is completely wrong.
I apologize for not posting any actual results, but not only are they quite large, but they are somewhat unintuitive since they are reformatted into an XML which would probably take another question to explain anyways.

Thank you for you time!

来源：https://stackoverflow.com/questions/51139856/pseudoinverse-calculation-in-python

标签

pandas