statistics

How to use the spark stats?

我们两清 提交于 2020-05-17 06:54:10
问题 I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles i.e. percentile 0, percentile 25, etc, on each column of my given data. As I am doing multiple percentiles, how to retrieve each calculated percentile from the results? Here an example, having data as show below: +----+---------+-------------+----------+-----------+ | id| date|total_revenue|con_dist_1| con_dist_2| +----+---------+-------------+----------+-----------+ |3310|1/15/2018| 0.010680705| 6|0.019875458| |3310|1/15/2018| 0

Calculating scale/dispersion of Gamma GLM using statsmodels

余生长醉 提交于 2020-05-17 06:07:29
问题 I'm having trouble obtaining the dispersion parameter of simulated data using statsmodels' GLM function. import statsmodels.api as sm import matplotlib.pyplot as plt import scipy.stats as stats import numpy as np np.random.seed(1) # Generate data x=np.random.uniform(0, 100,50000) x2 = sm.add_constant(x) a = 0.5 b = 0.2 y_true = 1/(a+(b*x)) # Add error scale = 2 # the scale parameter I'm trying to obtain shape = y_true/scale # given that, for Gamma, mu = scale*shape y = np.random.gamma(shape

How to create a search for common fit distribution of two Goodness-to-fit tests list?

别等时光非礼了梦想. 提交于 2020-05-17 05:54:13
问题 I looked into the question Best fit Distribution plots and found out that answers used the Kolmogorov-Smirnov Test to find the best fit distribution. I also found out that there is an Anderson-Darling test that is also used to get the best fit distribution based on given data. So, I have a few questions: Question 1: If I want to combine both tests, how can I do that where it searches for the maximum p-value of both tests(find the highest p-value and is common in both tests then I extract the

Construction of confusion matrix

喜夏-厌秋 提交于 2020-05-15 21:23:40
问题 I have a question concerning the construction of confusion matrix from the below link: Ranger Predicted Class Probability of each row in a data frame If I have the following code for example (as explained by the answer in the link): library(ranger) library(caret) idx = sample(nrow(iris),100) data = iris data$Species = factor(ifelse(data$Species=="versicolor",1,0)) Train_Set = data[idx,] Test_Set = data[-idx,] mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE

Construction of confusion matrix

此生再无相见时 提交于 2020-05-15 21:20:06
问题 I have a question concerning the construction of confusion matrix from the below link: Ranger Predicted Class Probability of each row in a data frame If I have the following code for example (as explained by the answer in the link): library(ranger) library(caret) idx = sample(nrow(iris),100) data = iris data$Species = factor(ifelse(data$Species=="versicolor",1,0)) Train_Set = data[idx,] Test_Set = data[-idx,] mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE

How do Binance API calculate priceChangePercent in 24h

痞子三分冷 提交于 2020-05-15 10:27:06
问题 I am developing my own app in which I want to retrieve price data in a 24h period. I have read the docs provided by Binance at https://github.com/binance-exchange/binance-official-api-docs/blob/master/rest-api.md Then I try fetching 24hr ticker price change statistics by using the link https://api.binance.com/api/v1/ticker/24hr?symbol=BTCUSDT. The response is: { "symbol": "BTCUSDT", "priceChange": "111.60000000", "priceChangePercent": "1.314", "weightedAvgPrice": "8563.97044287",

Fittiing For Discrete Data: Negative Binomial, Poisson, Geometric Distribution

两盒软妹~` 提交于 2020-05-14 12:30:25
问题 In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this. For example if i have an array like below: x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0] I couldn' t apply for this array; from scipy.stats import nbinom param = nbinom.fit(x) But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset? 回答1: You can use Method of Moments to fit

Is pandas showing the wrong percentile?

核能气质少年 提交于 2020-05-13 04:51:37
问题 I'm working with this WNBA dataset here. I'm analyzing the Height variable, and below is a table showing frequency, cumulative percentage, and cumulative frequency for each height value recorded: From the table I can easily conclude that the first quartile (the 25th percentile) cannot be larger than 175. However, when I use Series.describe() , I'm told that the 25th percentile is 176.5. Why is that so? wnba.Height.describe() count 143.000000 mean 184.566434 std 8.685068 min 165.000000 25% 176

How to write a loop to run the t-test of a data frame?

牧云@^-^@ 提交于 2020-05-11 07:17:30
问题 I met a problem of running a t-test for some data stored in a data frame. I know how to do it one by one but not efficient at all. May I ask how to write a loop to do it? For example, I have got the data in the testData: testData <- dput(testData) structure(list(Label = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("Bad", "Good"), class = "factor"), F1 = c(0.647789237, 0.546087915, 0.461342005, 0.794212207, 0.569199511, 0

How to write a loop to run the t-test of a data frame?

谁说我不能喝 提交于 2020-05-11 07:16:46
问题 I met a problem of running a t-test for some data stored in a data frame. I know how to do it one by one but not efficient at all. May I ask how to write a loop to do it? For example, I have got the data in the testData: testData <- dput(testData) structure(list(Label = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("Bad", "Good"), class = "factor"), F1 = c(0.647789237, 0.546087915, 0.461342005, 0.794212207, 0.569199511, 0