statistics

How to get make stats in constant memory

孤者浪人 提交于 2020-01-01 18:19:54
问题 I have a function, which creates some random numerical results. I know, that the result will be an integer in a (small, a - b approx 50) range a, b . I want to create a function which execute the above function let's say 1000000 times and calculates, how often the each result appears. (The function takes a random generator to produce the result.) The problem is, I don't know how to do this in constant memory without hard-coding the range's length. My (bad) approach is like this: values ::

Difference between categorical variables (factors) and dummy variables

徘徊边缘 提交于 2020-01-01 12:00:15
问题 I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor() was equivalent to having dummy variables. Could someone explain the difference between the following two linear regression models? Linear Model 1, where Month is a factor: dt_long Sales Period Month 1: 0.4898943 1 M1 2: 0

two whole texts similarity using levenshtein distance [closed]

自闭症网瘾萝莉.ら 提交于 2020-01-01 10:58:08
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I have two text files which I'd like to compare. What I did is: I've split both of them into sentences. I've measured levenshtein distance between each of the sentences from one file with each of the sentences from second file. I'd like to calculate average similarity between those two text files, however I have

How to calculate mean and standard deviation for hue values from 0 to 360?

吃可爱长大的小学妹 提交于 2020-01-01 05:23:07
问题 Suppose 5 samples of hue are taken using a simple HSV model for color, having values 355, 5, 5, 5, 5, all a hue of red and "next" to each other as far as perception is concerned. But the simple average is 75 which is far away from 0 or 360, close to a yellow-green. What is a better way to calculate this mean and associated std? 回答1: The simple solution is to convert those angles to a set of vectors, from polar coordinates into cartesian coordinates. Since you are working with colors, think of

How to use princomp () function in R when covariance matrix has zero's?

放肆的年华 提交于 2020-01-01 04:36:07
问题 While using princomp() function in R, the following error is encountered : "covariance matrix is not non-negative definite" . I think, this is due to some values being zero (actually close to zero, but becomes zero during rounding) in the covariance matrix. Is there a work around to proceed with PCA when covariance matrix contains zeros ? [FYI : obtaining the covariance matrix is an intermediate step within the princomp() call. Data file to reproduce this error can be downloaded from here -

Fit two normal distributions (histograms) with MCMC using pymc?

吃可爱长大的小学妹 提交于 2020-01-01 04:19:12
问题 I am trying to fit line profiles as detected with a spectrograph on a CCD. For ease of consideration, I have included a demonstration that, if solved, is very similar to the one I actually want to solve. I've looked at this: https://stats.stackexchange.com/questions/46626/fitting-model-for-two-normal-distributions-in-pymc and various other questions and answers, but they are doing something fundamentally different than what I want to do. import pymc as mc import numpy as np import pylab as pl

ID3 and C4.5: How Does “Gain Ratio” Normalize “Gain”?

独自空忆成欢 提交于 2020-01-01 03:39:30
问题 The ID3 algorithm uses "Information Gain" measure. The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo , whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise. My question is: How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the

normality test of a distribution in python

我的梦境 提交于 2020-01-01 02:40:38
问题 I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed. I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is

Efficient calculation of var-covar matrix in R

时光毁灭记忆、已成空白 提交于 2019-12-31 21:58:31
问题 I'm looking for efficiency gains in calculating the (auto)covariance matrix from individual measurements over time t with t, t-1 , etc.. In the data matrix, each row represents an individual and each column represents monthly measurements (the columns are in time order). Similar to the following data (although with some more co-variance). # simulate data set.seed(1) periods <- 70L ind <- 90000L mat <- sapply(rep(ind, periods), rnorm) Below is the (ugly) code I came up with to get the

Conditionally colour data points outside of confidence bands in R

风流意气都作罢 提交于 2019-12-31 10:49:26
问题 I need to colour datapoints that are outside of the the confidence bands on the plot below differently from those within the bands. Should I add a separate column to my dataset to record whether the data points are within the confidence bands? Can you provide an example please? Example dataset: ## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html ## Disease severity as a function of temperature # Response variable, disease severity