statistics

Which Git commit stats are easy to pull

好久不见. 提交于 2019-12-17 21:26:20
问题 Previously I have enjoyed TortoiseSvn's ability to generate simple commit stats for a given SVN repository. I wonder what is available in Git and am particularly interested in : Number of commits per user Number of lines changed per user activity over time (for instance aggregated weekly changes) Any ideas? 回答1: Actually, git already has a command for this: git shortlog in your case, it sounds like you're interested in this form: git shortlog -sne See the --help for various options. You may

Manual simulation of Markov Chain in R

半世苍凉 提交于 2019-12-17 20:32:31
问题 Consider the Markov chain with state space S = {1, 2} , transition matrix and initial distribution α = (1/2, 1/2) . Simulate 5 steps of the Markov chain (that is, simulate X 0 , X 1 , . . . , X 5 ). Repeat the simulation 100 times. Use the results of your simulations to solve the following problems. Estimate P(X 1 = 1|X 0 = 1) . Compare your result with the exact probability. My solution: # returns Xn func2 <- function(alpha1, mat1, n1) { xn <- alpha1 %*% matrixpower(mat1, n1+1) return (xn) }

How can I ensure that a partition has representative observations from each level of a factor?

放肆的年华 提交于 2019-12-17 19:26:00
问题 I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable? test.df <- data.frame(a = sample(c(0,1),100, rep = T), b = factor(sample

How can I plot the relative proportions of two groups using a fill aesthetic in ggplot2?

会有一股神秘感。 提交于 2019-12-17 18:44:30
问题 How can I plot the relative proportions of two groups using a fill aesthetic in ggplot2? I am asking this question here because several other answers on this topic seem incorrect (ex1, ex2, and ex3), but Cross Validated seems to have functionally banned R specific questions (CV meta). ..density.. is conceptually related to, but distinct from proportions (ex4 and ex5). So the correct answer does not seem to involve density. Example: set.seed(1200) test <- data.frame( test1 = factor(sample

SQL why is SELECT COUNT(*) , MIN(col), MAX(col) faster then SELECT MIN(col), MAX(col)

♀尐吖头ヾ 提交于 2019-12-17 18:32:06
问题 We're seeing a huge difference between these queries. The slow query SELECT MIN(col) AS Firstdate, MAX(col) AS Lastdate FROM table WHERE status = 'OK' AND fk = 4193 Table 'table'. Scan count 2, logical reads 2458969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1966 ms, elapsed time = 1955 ms. The fast query SELECT count(*), MIN(col) AS Firstdate, MAX(col) AS Lastdate FROM table WHERE status =

Calculating weighted mean and standard deviation

跟風遠走 提交于 2019-12-17 18:22:22
问题 I have a time series x_0 ... x_t . I would like to compute the exponentially weighted variance of the data. That is: V = SUM{w_i*(x_i - x_bar)^2, i=1 to T} where SUM{w_i} = 1 and x_bar=SUM{w_i*x_i} ref: http://en.wikipedia.org/wiki/Weighted_mean#Weighted_sample_variance The goal is to basically weight observations that are further back in time less. This is very simple to implement but I would like to use as much built in funcitonality as possible. Does anyone know what this corresponds to in

why does scikitlearn says F1 score is ill-defined with FN bigger than 0?

[亡魂溺海] 提交于 2019-12-17 17:47:29
问题 I run a python program that calls sklearn.metrics 's methods to calculate precision and F1 score. Here is the output when there is no predicted sample: /xxx/py2-scikit-learn/0.15.2-comp6/lib/python2.6/site-packages/sklearn/metr\ ics/metrics.py:1771: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for) /xxx/py2-scikit-learn/0.15.2-comp6/lib/python2.6/site-packages/sklearn/metr\ ics/metrics.py:1771:

How to find probability distribution and parameters for real data? (Python 3)

☆樱花仙子☆ 提交于 2019-12-17 17:32:10
问题 I have a dataset from sklearn and I plotted the distribution of the load_diabetes.target data (i.e. the values of the regression that the load_diabetes.data are used to predict). I used this because it has the fewest number of variables/attributes of the regression sklearn.datasets . Using Python 3, How can I get the distribution-type and parameters of the distribution this most closely resembles? All I know the target values are all positive and skewed (positve skew/right skew). . . Is there

Calculating the percentage of variance measure for k-means?

我的梦境 提交于 2019-12-17 17:27:07
问题 On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in

R ggplot2: using stat_summary (mean) and logarithmic scale

流过昼夜 提交于 2019-12-17 17:13:42
问题 I have a bunch of measurements over time and I want to plot them in R. Here is a sample of my data. I've got 6 measurements for each of 4 time points: values <- c (1012.0, 1644.9, 837.0, 1200.9, 1652.0, 981.5, 2236.9, 1697.5, 2087.7, 1500.8, 2789.3, 1502.9, 2051.3, 3070.7, 3105.4, 2692.5, 1488.5, 1978.1, 1925.4, 1524.3, 2772.0, 1355.3, 2632.4, 2600.1) time <- factor (rep (c(0, 12, 24, 72), c(6, 6, 6, 6))) The scale of these data is arbitrary, and in fact I'm going to normalize it so that the