Perfect (or near) multicollinearity in julia

问题

Running a simple regression model in Julia with the presence of perfect multicollinearity produces an error. In R, we can run the same model producing NAs in the estimations of the corresponding covariates which R interprets: "not defined because of singularities". We can identify those variables using the alias() function in R.

Is there any way I can check for perfect multicollinearity in Julia prior to modeling in order to drop the collinear variables?

回答1:

Detecting Perfect Collinearity

Suppose that X is your design matrix. You can check for perfect multicollinearity by running:

rank(X) == size(X,2)

This will yield false if you have perfect multicollinearity.

Identifying Near Collinearity + Finding Which Columns are Collinear, or near-Collinear

I don't know of any specific built-in functions for this. But, an application of some basic principles of linear algebra can determine this pretty easily. Below is a function I wrote which does this, followed by a more detailed explanation for those interested. The gist of it though is that we want to find the eigenvalues of X*X' that are zero (for perfect collinearity) or close to zero (for near collinearity). Then, we find the eigenvectors associated with those eigenvalues. The components of those eigenvectors that are non-zero (for perfect collinearity) or moderately large-ish (a term that is ambiguous by the nature of "near collinearity" being ambiguous) are the columns that have collinearity problems.

function LinDep(A::Array, threshold1::Float64 = 1e-6, threshold2::Float64 = 1e-1; eigvec_output::Bool = false)
    (L, Q) = eig(A'*A)
    max_L = maximum(abs(L))
    conditions = max_L ./ abs(L)
    max_C = maximum(conditions)
    println("Max Condition = $max_C")
    Collinear_Groups = []
    Tricky_EigVecs = []
    for (idx, lambda) in enumerate(L)
        if lambda < threshold1
            push!(Collinear_Groups, find(abs(Q[:,idx]) .> threshold2))
            push!(Tricky_EigVecs, Q[:,idx])
        end
    end
    if eigvec_output
        return (Collinear_Groups, Tricky_EigVecs)
    else
        return Collinear_Groups       
    end
end

Simple example to start with. It's easy to see this matrix has collinearity problems:

A1 = [1 3 1 2 ; 0 0 0 0 ; 1 0 0 0 ; 1 3 1 2]

4x4 Array{Int64,2}:
 1  3  1  2
 0  0  0  0
 1  0  0  0
 1  3  1  2

Collinear_Groups1 = LinDep(A1)
 [2,3]  
 [2,3,4]

Max Condition = 5.9245306995900904e16

There are two eigenvalues that equal 0 here. Thus, the function gives us two sets of "problem" columns. We want to remove one or more of the columns here to address the collinearity. Clearly, as with the nature of collinearity, there is no "right" answer. E.g. Col3 is clearly just 1/2 of Col4. So, we could remove either one to address that collinearity issue.

Note: the max condition here is the largest ratio of the the maximum eigenvalue to each of the other eigenvalues. A general guideline is that a max condition > 100 means moderate collinearity, and > 1000 means severe collinearity (see e.g. Wikipedia.) But, a LOT depends on the specifics of your situation, and so relying on simplistic rules like this is not particularly advisable. It's much better to consider this as one factor amongst many, including also things like analysis of the eigenvectors and your knowledge of the underlying data and where you suspect collinearity might or might not be present. In any case, we see that it is huge here, which is to be expected.

Now, lets consider a harder situation where there isn't perfect collinearity, but just near collinearity. We can use the function as is, but I think it is helpful to switch on the eigvec_output option to let us see the eigenvectors that correspond with the problematic eigenvalues. Also, you may well want to do some tinkering with the thresholds specified, in order to adjust the sensitivity to picking up near collinearity. Or, just set them both pretty big (particularly the second one) and spend most of your time examining the eigector outputs.

srand(42); ## set random seed for reproducibility
N = 10
A2 = rand(N,N);
A2[:,2] = 2*A2[:,3] +0.8*A2[:,4] + (rand(N,1)/100); ## near collinearity
(Collinear_Groups2, Tricky_EigVecs2)  = LinDep(A2, eigvec_output = true)

Max Condition = 4.6675275950744677e8

Our max condition is notably smaller now, which is nice, but still clearly quite severe.

Collinear_Groups2
1-element Array{Any,1}:
 [2,3,4]


Tricky_EigVecs2[1]
julia> Tricky_EigVecs2[1]
10-element Array{Float64,1}:
  0.00537466
  0.414383  
 -0.844293  
 -0.339419  
  0.00320918
  0.0107623 
  0.00599574
 -0.00733916
 -0.00128179
 -0.00214224

Here, we see that columns 2,3,4 have relatively large components of the eigenvector associated with them. This shows us that these are the problematic columns for near collinearity, which of course is what we expected given how we created our matrix!

Why does this work ?

From basic linear algebra, any symmetric matrix can be diagonalized as:

A = Q * L * Q'

Where L is a diagonal matrix containing its eigenvalues and Q is a matrix of its corresponding eigenvectors.

Thus, suppose that we have a design matrix X in a regression analysis. The matrix X'X will always be symmetric and thus diagonalizable as described above.

Similarly, we will always have rank(X) = rank(X'X) meaning that if X contains linearly dependent columns and is less than full rank, so will be X'X.

Now, recall that by the definition of an eigenvalue ( L[i] ) and an eigenvector Q[:,i], we have:

A * Q[:,i] = L[i] * Q[:,i]

In the case that L[i] = 0 then this becomes:

A * Q[:,i] = 0

for some non-zero Q[:,i]. This is the definition of A having a linearly dependent column.

Furthermore, since A * Q[:,i] = 0 can be rewritten as the sum of the columns of A weighted by the components of Q[:,i]. Thus, if we let S1 and S2 be two mutually exclusive sets, then we have

sum (j in S1) A[:,j]*Q[:,i][j] = sum (j in S2) A[:,j]*Q[:,i][j]

I.e. some combination of columns of A can be written as a weighted combination of the other columns.

Thus, if we knew for instance that L[i] = 0 for some i, and then we look at the corresponding Q[:,i] and see Q[:,i] = [0 0 1 0 2 0] then we know that column 3 = -2 times column 5 and thus we want to remove one or the other.

来源：https://stackoverflow.com/questions/39082724/perfect-or-near-multicollinearity-in-julia

标签

regression

julia

linear-algebra