pearson-correlation

Bert fine-tuned for semantic similarity

夙愿已清 提交于 2020-06-08 12:31:33
问题 I would like to apply fine-tuning Bert to calculate semantic similarity between sentences. I search a lot websites, but I almost not found downstream about this. I just found STS benchmark . I wonder if I can use STS benchmark dataset to train a fine-tuning bert model, and apply it to my task. Is it reasonable? As I know, there are a lot method to calculate similarity including cosine similarity, pearson correlation, manhattan distance, etc. How choose for semantic similarity? 回答1: As a

Bert fine-tuned for semantic similarity

南笙酒味 提交于 2020-06-08 12:28:11
问题 I would like to apply fine-tuning Bert to calculate semantic similarity between sentences. I search a lot websites, but I almost not found downstream about this. I just found STS benchmark . I wonder if I can use STS benchmark dataset to train a fine-tuning bert model, and apply it to my task. Is it reasonable? As I know, there are a lot method to calculate similarity including cosine similarity, pearson correlation, manhattan distance, etc. How choose for semantic similarity? 回答1: As a

Pearson correlation and nan values

最后都变了- 提交于 2019-12-18 16:33:13
问题 I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error: "ValueError: operands could not be broadcast together with shapes (1020,) (1016,)" Question: If row

Pearson correlation and nan values

跟風遠走 提交于 2019-12-18 16:33:09
问题 I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error: "ValueError: operands could not be broadcast together with shapes (1020,) (1016,)" Question: If row

Why does this example result in NaN?

心已入冬 提交于 2019-12-12 06:46:38
问题 I'm looking at the documentation for Statistics.corr in PySpark: https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib.stat.Statistics-class.html#corr. Why does the correlation here result in NaN ? >>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]), ... Vectors.dense([6, 7, 0, 8]), Vectors.dense([9, 0, 0, 1])]) >>> pearsonCorr = Statistics.corr(rdd) >>> print str(pearsonCorr).replace('nan', 'NaN') [[ 1. 0.05564149 NaN 0.40047142] [ 0.05564149 1. NaN 0

How to find the correlation between continuous and categorical variables in R

烈酒焚心 提交于 2019-12-11 07:04:35
问题 sorry, I edited my question. In R, you can use the cor () function to find the correlation using only Pearson and Spearman correlation between Continuous variables. Which function should I use to get correlation between categorical variable and categorical variable? and Which function should I use to get correlation between categorical variables and Continuous variable Thank you in advance. 来源: https://stackoverflow.com/questions/41053431/how-to-find-the-correlation-between-continuous-and

efficient number generator for correlation studies

痴心易碎 提交于 2019-12-08 12:52:32
问题 My goal is to generate 7 numbers within a min and max range that correspond to a Pearson correlation coefficient of greater than 0.95. I have been successful with 3 numbers (obviously because this isn't very computationally demanding).. however for 4 numbers, the computation required seems very large (i.e. on the order of 10k iterations). 7 numbers would be almost impossible with the current code. Current code: def pearson_def(x, y): assert len(x) == len(y) n = len(x) assert n > 0 avg_x =

Remove strongly correlated columns from DataFrame

六月ゝ 毕业季﹏ 提交于 2019-12-08 00:23:30
问题 I have a DataFrame like this dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]} df = pd.DataFrame(dict_, columns=dict_.keys()) I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95 def trimm_correlated(df_in, threshold): df_corr = df_in.corr(method='pearson', min_periods=1) df_not_correlated = ~(df_corr

Efficient columnwise correlation coefficient calculation

人走茶凉 提交于 2019-12-07 04:14:08
问题 Original question I am correlating row P of size n to every column of matrix O of size n×m. I crafted the following code: import numpy as np def ColumnWiseCorrcoef(O, P): n = P.size DO = O - (np.sum(O, 0) / np.double(n)) DP = P - (np.sum(P) / np.double(n)) return np.dot(DP, DO) / np.sqrt(np.sum(DO ** 2, 0) * np.sum(DP ** 2)) It is more efficient than the naive approach: def ColumnWiseCorrcoefNaive(O, P): return np.corrcoef(P,O.T)[0,1:O[0].size+1] Here are the timings I get with numpy-1.7.1

Remove strongly correlated columns from DataFrame

耗尽温柔 提交于 2019-12-06 12:20:37
I have a DataFrame like this dict_ = {'Date':['2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-05'],'Col1':[1,2,3,4,5],'Col2':[1.1,1.2,1.3,1.4,1.5],'Col3':[0.33,0.98,1.54,0.01,0.99]} df = pd.DataFrame(dict_, columns=dict_.keys()) I then calculate the pearson correlation between the columns and filter out columns that are correlated above my threshold of 0.95 def trimm_correlated(df_in, threshold): df_corr = df_in.corr(method='pearson', min_periods=1) df_not_correlated = ~(df_corr.mask(np.eye(len(df_corr), dtype=bool)).abs() > threshold).any() un_corr_idx = df_not_correlated.loc[df