Correlation coefficients for sparse matrix in python?

前端 未结 4 2004
[愿得一人]
[愿得一人] 2021-02-07 11:31

Does anyone know how to compute a correlation matrix from a very large sparse matrix in python? Basically, I am looking for something like numpy.corrcoef that will

4条回答
  •  眼角桃花
    2021-02-07 12:15

    I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.

    I also try to minimize calculations using the following strategy:

    • Process each column
    • Start at the current column + 1 and calculate correlations moving to the right.
    • For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
    • Perform these steps for each column in the dataset except the last.

    This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.

    Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.

    Check each column using this function:

    def get_corr_row(idx_num, sp_mat, thresh):
        # slice the column at idx_num
        cols = sp_mat.shape[1]
        x = sp_mat[:,idx_num].toarray().ravel()
        start = idx_num + 1
        
        # Now slice each column to the right of idx_num   
        for i in range(start, cols):
            y = sp_mat[:,i].toarray().ravel()
            # Check the pearson correlation
            corr, pVal = pearsonr(x,y)
            # Pearson ranges from -1 to 1.
            # We check both positive and negative correlations >= thresh using abs(corr)
            if abs(corr) >= thresh:
                # stop checking after finding the 1st correlation > thresh   
                return False
                # Mark column at idx_num for removal in the mask  
        return True  
        
    

    Run the column level correlation checks in parallel:

    from joblib import Parallel, delayed  
    import multiprocessing
    
    
    def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
        
        # we must make sure the matrix is in csc format 
        # before we start doing all these column slices!  
        sp_mat = sp_mat.tocsc()
        cols = sp_mat.shape[1]
        
        if n_jobs == -1:
            # Process the work on all available CPU cores
            num_cores = multiprocessing.cpu_count()
        else:
            # Process the work on the specified number of CPU cores
            num_cores = n_jobs
    
        # Return a mask of all columns to keep by calling get_corr_row() 
        # once for each column in the matrix     
        return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
    

    General Usage:

    #Get the mask using your sparse matrix and threshold.
    corr_mask = Get_Corr_Mask(X_t_fpr, 0.95) 
    
    # Remove features that are >= 95% correlated
    X_t_fpr_corr = X_t_fpr[:,corr_mask]
    

提交回复
热议问题