statistics

How to calculate the numbers of the observations in quantiles?

夙愿已清 提交于 2021-01-27 18:15:08
问题 Consider I have a million of observations following Gamma distribution with parameters (3,5). I am able to find the quantiles using summary() but I am trying to find how many observations are between each red lines which were divided into 10 pieces? a = rgamma(1e6, shape = 3, rate = 5) summary(a) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0053 0.3455 0.5351 0.6002 0.7845 4.4458 回答1: We may use cut with table : table(cut(a, quantile(a, 0:10 / 10))) # (0.00202,0.22] (0.22,0.307] (0.307,0.382] (0

How to calculate the numbers of the observations in quantiles?

末鹿安然 提交于 2021-01-27 18:04:49
问题 Consider I have a million of observations following Gamma distribution with parameters (3,5). I am able to find the quantiles using summary() but I am trying to find how many observations are between each red lines which were divided into 10 pieces? a = rgamma(1e6, shape = 3, rate = 5) summary(a) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0053 0.3455 0.5351 0.6002 0.7845 4.4458 回答1: We may use cut with table : table(cut(a, quantile(a, 0:10 / 10))) # (0.00202,0.22] (0.22,0.307] (0.307,0.382] (0

How can I generate data which will show inverted bell curve for normal distribution

瘦欲@ 提交于 2021-01-27 11:22:41
问题 I have generated random data which follows normal distribution using the below code: import numpy as np import matplotlib.pyplot as plt import seaborn as sns rng = np.random.default_rng() number_of_rows = 10000 mu = 0 sigma = 1 data = rng.normal(loc=mu, scale=sigma, size=number_of_rows) dist_plot_data = sns.distplot(data, hist=False) plt.show() The above code generates the below distribution plot as expected: If I want to create a distribution plot that is exactly an inverse curve like below

Calculate correlation coefficient between words?

孤街浪徒 提交于 2021-01-27 06:32:09
问题 For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others. This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text. How can I best approach this problem? How can I calculate the correlation between

scipy p-value returns 0.0

﹥>﹥吖頭↗ 提交于 2021-01-21 18:27:41
问题 Using a 2 sample Kolmogorov Smirnov test, I am getting a p-value of 0.0. >>>scipy.stats.ks_2samp(dataset1, dataset2) (0.65296076312083573, 0.0) Looking at the histograms of the 2 datasets, I am quite confident they represent two different datasets. But, really, p = 0.0? That doesn't seem to make sense. Shouldn't it be a very small but positive number? I know the return value is of type numpy.float64. Does that have something to do with it? EDIT: data here: https://www.dropbox.com/s

scipy p-value returns 0.0

生来就可爱ヽ(ⅴ<●) 提交于 2021-01-21 18:26:26
问题 Using a 2 sample Kolmogorov Smirnov test, I am getting a p-value of 0.0. >>>scipy.stats.ks_2samp(dataset1, dataset2) (0.65296076312083573, 0.0) Looking at the histograms of the 2 datasets, I am quite confident they represent two different datasets. But, really, p = 0.0? That doesn't seem to make sense. Shouldn't it be a very small but positive number? I know the return value is of type numpy.float64. Does that have something to do with it? EDIT: data here: https://www.dropbox.com/s

When a function is equal to a certain value

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-21 09:55:49
问题 I am extremely new to R, so the solution to this is probably relatively simple. I have the following function to calculate stopping distance for an average car: distance <- function(mph){(2.0*(mph/60))+(0.062673*(mph^1.9862))} And I'm plotting all stopping distances from 1 mph to 60 mph: range = distance(1:60) But I need to mark where the stopping distance is equal to 120 ft. I don't have any idea how this is done in R, but I'd like to write a function where, for a stoppingdistance(x), I get

When a function is equal to a certain value

情到浓时终转凉″ 提交于 2021-01-21 09:55:33
问题 I am extremely new to R, so the solution to this is probably relatively simple. I have the following function to calculate stopping distance for an average car: distance <- function(mph){(2.0*(mph/60))+(0.062673*(mph^1.9862))} And I'm plotting all stopping distances from 1 mph to 60 mph: range = distance(1:60) But I need to mark where the stopping distance is equal to 120 ft. I don't have any idea how this is done in R, but I'd like to write a function where, for a stoppingdistance(x), I get

How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

三世轮回 提交于 2021-01-20 16:50:57
问题 I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI) , and p-value to access statistical significance. Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores. Here are my specific questions: How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g

How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

岁酱吖の 提交于 2021-01-20 16:42:37
问题 I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI) , and p-value to access statistical significance. Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores. Here are my specific questions: How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g