Some of my columns get missing when I use df.corr in Pandas

问题

Here is my code:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('death_regression2.csv')
data3 = data.replace(r'\s+', np.nan, regex = True)  


plt.figure(figsize=(90,90)) 
corr = data3.corr()

print(np.shape(list(corr)))
print(np.shape(data3))

(135,) (4909, 204)

So before I use the correlation function, the total number of parameters was 204(number of the columns) but after using data3.corr(), some parameters go missing, reduced to 135.

How do check the correlation between all columns in the data?

回答1:

Without seeing any additional data to understand why you are missing columns, we will have to inspect what pd.DataFrame.corr does.

As the documentation outlines it computes the pairwise correlations of columns. Because you specified no arguments is uses the default method and calculate Pearson's r, which measures the linear correlation between two variables (X, Y) and can take values between -1 and 1 corresponding to an exact negative linear correlation to an exact positive linear correlation and all the values in between, with 0 being no correlation (i.e., the plot of X against Y is a random and a linear regression would fit a flat slope).

For non-numerical variables, there is no concept of correlation (at least within the context of Pearson's r and this answer) and pd.DataFrame.corr simply ignores non-numerical (i.e., non-float or non-integer values) and drops these columns, explaining why you have less columns.

If your dropped values are in fact numerical but stored (for example) as strings, you probably need to convert them before calling .corr().

As an example:

x = np.random.rand(10)
y = np.random.rand(10)
x_scaled = x*6 
cat = ['one', 'two', 'three', 'four', 'five', 
       'six','seven', 'eight', 'nine', 'ten']

df = pd.DataFrame({'x':x, 'y':y, 'x_s':x_scaled, 'cat':cat})

df.corr()

returns:

        x            y          x_s
 x   1.000000    -0.470699    1.000000
 y  -0.470699     1.000000   -0.470699
x_s  1.000000    -0.470699    1.000000

which is our correlation matrix but our non-numerical column (cat) has been dropped.

If you plot the different numerical variables against each other you get the below plot:

which helps highlight the different correlations: by chance there is a negative linear correlation between x and y.

来源：https://stackoverflow.com/questions/54980417/some-of-my-columns-get-missing-when-i-use-df-corr-in-pandas

标签

python

pandas

correlation