statistics

Eliminating outliers by standard deviation in SQL Server

怎甘沉沦 提交于 2020-02-22 06:11:14
问题 I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean. How can I accomplish this? 回答1: If you are assuming a bell curve distribution of events, then only 68% of values will be within 1 standard deviation away from the mean (95% are covered by 2 standard deviations). I would load a variable with the standard deviation of your range (derived using stdev

Eliminating outliers by standard deviation in SQL Server

↘锁芯ラ 提交于 2020-02-22 06:11:05
问题 I am trying to eliminate outliers in SQL Server 2008 by standard deviation. I would like only records that contain a value in a specific column within +/- 1 standard deviation of that column's mean. How can I accomplish this? 回答1: If you are assuming a bell curve distribution of events, then only 68% of values will be within 1 standard deviation away from the mean (95% are covered by 2 standard deviations). I would load a variable with the standard deviation of your range (derived using stdev

Sorting algorithms for data of known statistical distribution?

时光怂恿深爱的人放手 提交于 2020-02-16 18:47:29
问题 It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account. So my question is, are there any sorting algorithms that take into account that kind of information? How good are they? Edit : an example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This

Sorting algorithms for data of known statistical distribution?

痴心易碎 提交于 2020-02-16 18:44:48
问题 It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account. So my question is, are there any sorting algorithms that take into account that kind of information? How good are they? Edit : an example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This

Sorting algorithms for data of known statistical distribution?

旧城冷巷雨未停 提交于 2020-02-16 18:42:20
问题 It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account. So my question is, are there any sorting algorithms that take into account that kind of information? How good are they? Edit : an example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This

Sorting algorithms for data of known statistical distribution?

橙三吉。 提交于 2020-02-16 18:42:14
问题 It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account. So my question is, are there any sorting algorithms that take into account that kind of information? How good are they? Edit : an example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This

Multiple distribution normality testing and transformation in pandas dataframe

徘徊边缘 提交于 2020-02-06 08:02:47
问题 Situation: Lets consider a massive retail network (hundreds of products and thousands of stores) simplified as follows: Store 1, Store 2 Product A, Product B, Product C I am trying to identify anomalies in sales numbers to know which stores do very well and which do very badly. My first idea was to calculate the means and standard deviations of sales and qualify as anomalies everything that is outside the bounds of 3 standard deviations (~5% of the cases in a normal distribution). However,

Confidence Interval (CI) simulation in R: How?

一世执手 提交于 2020-02-02 18:56:21
问题 I was wondering how I could check via simulation in R that the 95% Confidence Interval obtained from a binomial test with 5 successes in 15 trials when TRUE p = .5 has a 95% "Coverage Probability" in the long-run? Here is the 95% CI for such a test using R (how can show that the following CI has a 95% coverage in the long-run if TRUE p = .5 ): as.numeric(binom.test(x = 5, n = 15, p = .5)[[4]]) # > [1] 0.1182411 0.6161963 (in the long-run 95% of the time, ".5" is contained within these # two

Multivariate linear regression in pymc3

自古美人都是妖i 提交于 2020-02-01 09:03:12
问题 I've recently started learning pymc3 after exclusively using emcee for ages and I'm running into some conceptual problems. I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee , but pymc is giving me some problems. It essentially boils down to using a multivariate gaussian likelihood. Here is what I have so far. from pymc3 import * import numpy as np import

Get node list from random walk in networkX

自作多情 提交于 2020-01-31 18:14:28
问题 I am new to networkX. I created a graph as follows: G = nx.read_edgelist(filename, nodetype=int, delimiter=',', data=(('weight', float),)) where the edges are positive, but do not sum up to one. Is there a built-in method that makes a random walk of k steps from a certain node and return the node list? If not, what is the easiest way of doing it (nodes can repeat)? Pseudo-code: node = random res = [node] for i in range(0, k) read edge weights from this node an edge from this node has