correlation

Cross-correlation (time-lag-correlation) with pandas?

感情迁移 提交于 2019-12-03 02:44:40
问题 I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest. I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos. Edit The issue I am having with all the numpy/scipy methods, is that they

How to correlate an Ordinal Categorical column in pandas?

独自空忆成欢 提交于 2019-12-03 02:07:33
I have a DataFrame df with a non-numerical column CatColumn . A B CatColumn 0 381.1396 7.343921 Medium 1 481.3268 6.786945 Medium 2 263.3766 7.628746 High 3 177.2400 5.225647 Medium-High I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis. I am going to strongly disagree with the other comments. They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order

Correlated features and classification accuracy

喜夏-厌秋 提交于 2019-12-03 01:21:43
问题 I'd like to ask everyone a question about how correlated features (variables) affect the classification accuracy of machine learning algorithms. With correlated features I mean a correlation between them and not with the target class (i.e the perimeter and the area of a geometric figure or the level of education and the average income). In my opinion correlated features negatively affect eh accuracy of a classification algorithm, I'd say because the correlation makes one of them useless. Is

What is a fast way to compute column by column correlation in matlab

本小妞迷上赌 提交于 2019-12-02 23:59:10
I have two very large matrices (60x25000) and I'd like to compute the correlation between the columns only between the two matrices. For example: corrVal(1) = corr(mat1(:,1), mat2(:,1); corrVal(2) = corr(mat1(:,2), mat2(:,2); ... corrVal(i) = corr(mat1(:,i), mat2(:,i); For smaller matrices I can simply use: colCorr = diag( corr( mat1, mat2 ) ); but this doesn't work for very large matrices as I run out of memory. I've considered slicing up the matrices to compute the correlations and then combining the results but it seems like a waste to compute correlation between column combinations that I

how to interpret numpy.correlate and numpy.corrcoef values?

对着背影说爱祢 提交于 2019-12-02 20:22:04
I have two 1D arrays and I want to see their inter-relationships. What procedure should I use in numpy? I am using numpy.corrcoef(arrayA, arrayB) and numpy.correlate(arrayA, arrayB) and both are giving some results that I am not able to comprehend or understand. Can somebody please shed light on how to understand and interpret those numerical results (preferably using an example)? Thanks. ebarr numpy.correlate simply returns the cross-correlation of two vectors. if you need to understand cross-correlation, then start with http://en.wikipedia.org/wiki/Cross-correlation . A good example might be

How to get the correlation between two timeseries using Pandas

好久不见. 提交于 2019-12-02 19:27:10
I have two sets of temperature date, which have readings at regular (but different) time intervals. I'm trying to get the correlation between these two sets of data. I've been playing with Pandas to try to do this. I've created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB) . However, if the times in the 2 timeSeries do not match up exactly (they're generally off by seconds), I get Null as an answer. I could get a decent answer if I could: a) Interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don't know how to do it) b) strip the seconds

Jmeter: Excel Upload, hard coded parameters passing in next request

柔情痞子 提交于 2019-12-02 18:11:15
问题 I have recorded a Jmeter script where an excel with 4 records has been uploaded and in the next request the 4 values in the excel are passed as different parameters. But when I wil change the excel/no. of values changed to 100. How the request will take the new values of excel. As there will be more than 100 records and the record count is not known, so parameterization and correlation is not possible. Please help. 回答1: If you have Excel (xlsx) file under name of test.xlsx in "bin" folder of

Remove outliers from correlation coefficient calculation

安稳与你 提交于 2019-12-02 17:45:34
Assume we have two numeric vectors x and y . The Pearson correlation coefficient between x and y is given by cor(x, y) How can I automatically consider only a subset of x and y in the calculation (say 90%) as to maximize the correlation coefficient? If you really want to do this (remove the largest (absolute) residuals), then we can employ the linear model to estimate the least squares solution and associated residuals and then select the middle n% of the data. Here is an example: Firstly, generate some dummy data: require(MASS) ## for mvrnorm() set.seed(1) dat <- mvrnorm(1000, mu = c(4,5),

Bootstrapped correlation in R

◇◆丶佛笑我妖孽 提交于 2019-12-02 15:05:28
问题 I am trying to do a bootstrapped correlation in R. I have two variables Var1 and Var2 and I want to get the bootstrapped p.value of the Pearson correlation. my variables look like this: x y 1 .6080522 1.707642 2 1.4307273 1.772616 3 0.8226198 1.768537 4 1.7714221 1.265276 5 1.5986213 1.855719 6 1.0000000 1.606106 7 1.1678940 1.671457 8 0.6630012 1.608428 9 1.0842423 1.670619 10 0.5592512 1.107783 11 1.6442616 1.492832 12 0.8326965 1.643923 13 1.1696954 1.763181 14 0.7484543 1.762921 15 1

Cross-correlation (time-lag-correlation) with pandas?

╄→尐↘猪︶ㄣ 提交于 2019-12-02 14:53:51
I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest. I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos. Edit The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in