statistics | 易学教程

How to use the spark stats?

阅读更多关于 How to use the spark stats?

问题 I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles i.e. percentile 0, percentile 25, etc, on each column of my given data. As I am doing multiple percentiles, how to retrieve each calculated percentile from the results? Here an example, having data as show below: +----+---------+-------------+----------+-----------+ | id| date|total_revenue|con_dist_1| con_dist_2| +----+---------+-------------+----------+-----------+ |3310|1/15/2018| 0.010680705| 6|0.019875458| |3310|1/15/2018| 0

Calculating scale/dispersion of Gamma GLM using statsmodels

阅读更多关于 Calculating scale/dispersion of Gamma GLM using statsmodels

问题 I'm having trouble obtaining the dispersion parameter of simulated data using statsmodels' GLM function. import statsmodels.api as sm import matplotlib.pyplot as plt import scipy.stats as stats import numpy as np np.random.seed(1) # Generate data x=np.random.uniform(0, 100,50000) x2 = sm.add_constant(x) a = 0.5 b = 0.2 y_true = 1/(a+(b*x)) # Add error scale = 2 # the scale parameter I'm trying to obtain shape = y_true/scale # given that, for Gamma, mu = scale*shape y = np.random.gamma(shape

How to create a search for common fit distribution of two Goodness-to-fit tests list?

阅读更多关于 How to create a search for common fit distribution of two Goodness-to-fit tests list?

问题 I looked into the question Best fit Distribution plots and found out that answers used the Kolmogorov-Smirnov Test to find the best fit distribution. I also found out that there is an Anderson-Darling test that is also used to get the best fit distribution based on given data. So, I have a few questions: Question 1: If I want to combine both tests, how can I do that where it searches for the maximum p-value of both tests(find the highest p-value and is common in both tests then I extract the

Construction of confusion matrix

阅读更多关于 Construction of confusion matrix

问题 I have a question concerning the construction of confusion matrix from the below link: Ranger Predicted Class Probability of each row in a data frame If I have the following code for example (as explained by the answer in the link): library(ranger) library(caret) idx = sample(nrow(iris),100) data = iris data$Species = factor(ifelse(data$Species=="versicolor",1,0)) Train_Set = data[idx,] Test_Set = data[-idx,] mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE

Construction of confusion matrix

阅读更多关于 Construction of confusion matrix

How do Binance API calculate priceChangePercent in 24h

阅读更多关于 How do Binance API calculate priceChangePercent in 24h

问题 I am developing my own app in which I want to retrieve price data in a 24h period. I have read the docs provided by Binance at https://github.com/binance-exchange/binance-official-api-docs/blob/master/rest-api.md Then I try fetching 24hr ticker price change statistics by using the link https://api.binance.com/api/v1/ticker/24hr?symbol=BTCUSDT. The response is: { "symbol": "BTCUSDT", "priceChange": "111.60000000", "priceChangePercent": "1.314", "weightedAvgPrice": "8563.97044287",

Fittiing For Discrete Data: Negative Binomial, Poisson, Geometric Distribution

阅读更多关于 Fittiing For Discrete Data: Negative Binomial, Poisson, Geometric Distribution

问题 In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this. For example if i have an array like below: x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0] I couldn' t apply for this array; from scipy.stats import nbinom param = nbinom.fit(x) But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset? 回答1: You can use Method of Moments to fit

Is pandas showing the wrong percentile?

阅读更多关于 Is pandas showing the wrong percentile?

问题 I'm working with this WNBA dataset here. I'm analyzing the Height variable, and below is a table showing frequency, cumulative percentage, and cumulative frequency for each height value recorded: From the table I can easily conclude that the first quartile (the 25th percentile) cannot be larger than 175. However, when I use Series.describe() , I'm told that the 25th percentile is 176.5. Why is that so? wnba.Height.describe() count 143.000000 mean 184.566434 std 8.685068 min 165.000000 25% 176

How to write a loop to run the t-test of a data frame?

阅读更多关于 How to write a loop to run the t-test of a data frame?

问题 I met a problem of running a t-test for some data stored in a data frame. I know how to do it one by one but not efficient at all. May I ask how to write a loop to do it? For example, I have got the data in the testData: testData <- dput(testData) structure(list(Label = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("Bad", "Good"), class = "factor"), F1 = c(0.647789237, 0.546087915, 0.461342005, 0.794212207, 0.569199511, 0

How to write a loop to run the t-test of a data frame?

阅读更多关于 How to write a loop to run the t-test of a data frame?