statistics | 易学教程

Eliminating outliers by standard deviation in SQL Server

阅读更多关于 Eliminating outliers by standard deviation in SQL Server

问题 I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean. How can I accomplish this? 回答1: If you are assuming a bell curve distribution of events, then only 68% of values will be within 1 standard deviation away from the mean (95% are covered by 2 standard deviations). I would load a variable with the standard deviation of your range (derived using stdev

Eliminating outliers by standard deviation in SQL Server

阅读更多关于 Eliminating outliers by standard deviation in SQL Server

Sorting algorithms for data of known statistical distribution?

阅读更多关于 Sorting algorithms for data of known statistical distribution?

问题 It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account. So my question is, are there any sorting algorithms that take into account that kind of information? How good are they? Edit : an example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This

Sorting algorithms for data of known statistical distribution?

阅读更多关于 Sorting algorithms for data of known statistical distribution?

Sorting algorithms for data of known statistical distribution?

阅读更多关于 Sorting algorithms for data of known statistical distribution?

Sorting algorithms for data of known statistical distribution?

阅读更多关于 Sorting algorithms for data of known statistical distribution?

Multiple distribution normality testing and transformation in pandas dataframe

阅读更多关于 Multiple distribution normality testing and transformation in pandas dataframe

问题 Situation: Lets consider a massive retail network (hundreds of products and thousands of stores) simplified as follows: Store 1, Store 2 Product A, Product B, Product C I am trying to identify anomalies in sales numbers to know which stores do very well and which do very badly. My first idea was to calculate the means and standard deviations of sales and qualify as anomalies everything that is outside the bounds of 3 standard deviations (~5% of the cases in a normal distribution). However,

Confidence Interval (CI) simulation in R: How?

阅读更多关于 Confidence Interval (CI) simulation in R: How?

问题 I was wondering how I could check via simulation in R that the 95% Confidence Interval obtained from a binomial test with 5 successes in 15 trials when TRUE p = .5 has a 95% "Coverage Probability" in the long-run? Here is the 95% CI for such a test using R (how can show that the following CI has a 95% coverage in the long-run if TRUE p = .5 ): as.numeric(binom.test(x = 5, n = 15, p = .5)[[4]]) # > [1] 0.1182411 0.6161963 (in the long-run 95% of the time, ".5" is contained within these # two

Multivariate linear regression in pymc3

阅读更多关于 Multivariate linear regression in pymc3

问题 I've recently started learning pymc3 after exclusively using emcee for ages and I'm running into some conceptual problems. I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee , but pymc is giving me some problems. It essentially boils down to using a multivariate gaussian likelihood. Here is what I have so far. from pymc3 import * import numpy as np import

Get node list from random walk in networkX

阅读更多关于 Get node list from random walk in networkX

问题 I am new to networkX. I created a graph as follows: G = nx.read_edgelist(filename, nodetype=int, delimiter=',', data=(('weight', float),)) where the edges are positive, but do not sum up to one. Is there a built-in method that makes a random walk of k steps from a certain node and return the node list? If not, what is the easiest way of doing it (nodes can repeat)? Pseudo-code: node = random res = [node] for i in range(0, k) read edge weights from this node an edge from this node has