how to interpret numpy.correlate and numpy.corrcoef values?

问题

I have two 1D arrays and I want to see their inter-relationships. What procedure should I use in numpy? I am using numpy.corrcoef(arrayA, arrayB) and numpy.correlate(arrayA, arrayB) and both are giving some results that I am not able to comprehend or understand. Can somebody please shed light on how to understand and interpret those numerical results (preferably using an example)? Thanks.

回答1:

numpy.correlate simply returns the cross-correlation of two vectors.

if you need to understand cross-correlation, then start with http://en.wikipedia.org/wiki/Cross-correlation.

A good example might be seen by looking at the autocorrelation function (a vector cross-correlated with itself):

import numpy as np

# create a vector
vector = np.random.normal(0,1,size=1000) 

# insert a signal into vector
vector[::50]+=10

# perform cross-correlation for all data points
output = np.correlate(vector,vector,mode='full')

This will return a comb/shah function with a maximum when both data sets are overlapping. As this is an autocorrelation there will be no "lag" between the two input signals. The maximum of the correlation is therefore vector.size-1.

if you only want the value of the correlation for overlapping data, you can use mode='valid'.

回答2:

I can only comment on numpy.correlate at the moment. It's a powerful tool. I have used it for two purposes. The first is to find a pattern inside another pattern:

import numpy as np
import matplotlib.pyplot as plt

some_data = np.random.uniform(0,1,size=100)
subset = some_data[42:50]

mean = np.mean(some_data)
some_data_normalised = some_data - mean
subset_normalised = subset - mean

correlated = np.correlate(some_data_normalised, subset_normalised)
max_index = np.argmax(correlated)  # 42 !

The second use I have used it for (and how to interpret the result) is for frequency detection:

hz_a = np.cos(np.linspace(0,np.pi*6,100))
hz_b = np.cos(np.linspace(0,np.pi*4,100))

f, axarr = plt.subplots(2, sharex=True)

axarr[0].plot(hz_a)
axarr[0].plot(hz_b)
axarr[0].grid(True)

hz_a_autocorrelation = np.correlate(hz_a,hz_a,'same')[round(len(hz_a)/2):]
hz_b_autocorrelation = np.correlate(hz_b,hz_b,'same')[round(len(hz_b)/2):]

axarr[1].plot(hz_a_autocorrelation)
axarr[1].plot(hz_b_autocorrelation)
axarr[1].grid(True)

plt.show()

Find the index of the second peaks. From this you can work back to find the frequency.

first_min_index = np.argmin(hz_a_autocorrelation)
second_max_index = np.argmax(hz_a_autocorrelation[first_min_index:])
frequency = 1/second_max_index

回答3:

If you are perplexed about the outcome of np.correlate of int vectors, it may be due to overflow:

>>> a = np.array([4,3,2,1,0,0,0,0,10000,0,0,0], dtype='int16')
>>> np.correlate(a,a[:4])
array([    30,     20,     11,      4,      0,  10000,  20000,  30000,
   -25536], dtype=int16)

This example also explaines how correlate works:

30 = 4*4 + 3*3 + 2*2 + 1*1
20 = 4*3 + 3*2 + 2*1 + 1*0
11 = 4*2 + 3*1 + 2*0 + 1*0
...
40000 = 4*10000 + 3*0 + 2*0 + 1*0
shows up as 40000 - 2**16 = -25536

来源：https://stackoverflow.com/questions/13439718/how-to-interpret-numpy-correlate-and-numpy-corrcoef-values

标签

python

numpy

scipy

correlation