statistics | 易学教程

How can I calculate the variance of a list in python?

阅读更多关于 How can I calculate the variance of a list in python?

问题 If I have a list like this: results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097] I want to calculate the variance of this list in Python which is the average of the squared differences from the mean. How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences. 回答1: You can use numpy's built-in function var: import numpy as np results =

Weighted Gaussian kernel density estimation in `python`

阅读更多关于 Weighted Gaussian kernel density estimation in `python`

问题 It is currently not possible to use scipy.stats.gaussian_kde to estimate the density of a random variable based on weighted samples. What methods are available to estimate densities of continuous random variables based on weighted samples? 回答1: Neither sklearn.neighbors.KernelDensity nor statsmodels.nonparametric seem to support weighted samples. I modified scipy.stats.gaussian_kde to allow for heterogeneous sampling weights and thought the results might be useful for others. An example is

Methods for Geotagging or Geolabelling Text Content

阅读更多关于 Methods for Geotagging or Geolabelling Text Content

问题 What are some good algorithms for automatically labeling text with the city / region or origin? That is, if a blog is about New York, how can I tell programatically. Are there packages / papers that claim to do this with any degree of certainty? I have looked at some tfidf based approaches, proper noun intersections, but so far, no spectacular successes, and I'd appreciate ideas! The more general question is about assigning texts to topics, given some list of topics. Simple / naive approaches

Ways to calculate similarity

阅读更多关于 Ways to calculate similarity

问题 I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes: age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others. Can anyone tell me how to go about this problem or point me to some resources? 回答1: Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be

Ways to calculate similarity

阅读更多关于 Ways to calculate similarity

setting values for ntree and mtry for random forest regression model

阅读更多关于 setting values for ntree and mtry for random forest regression model

问题 I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201 . I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry ? Is there an approximate formula to find such parameter values? Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

Call R from JAVA to get Chi-squared statistic and p-value

阅读更多关于 Call R from JAVA to get Chi-squared statistic and p-value

问题 I have two 4*4 matrices in JAVA, where one matrix holds observed counts and the other expected counts. I need an automated way to calculate the p-value from the chi-square statistic between these two matrices; however, JAVA has no such function as far as I am aware. I can calculate the chi-square and its p-value by reading the two matrices into R as .csv file formats, and then using the chisq.test function as follows: obs<-read.csv("obs.csv") exp<-read.csv("exp.csv") chisq.test(obs,exp) where

Online algorithm for calculating standard deviation

阅读更多关于 Online algorithm for calculating standard deviation

问题 Normally, I have a more technical problem but I will simplify it for you with an example of counting balls. Assume I have balls of different colors and one index of an array (initialized to all 0's) reserved for each color. Every time I pick a ball, I increment the corresponding index by 1. Balls are picked randomly and I can only pick one ball at a time. My sole purpose is to count number of balls for every color, until I run out of balls. I would like to calculate standard deviation of the

Plotting functions on top of datapoints in R

阅读更多关于 Plotting functions on top of datapoints in R

问题 Is there a way of overlaying a mathematical function on top of data using ggplot? ## add ggplot2 library(ggplot2) # function eq = function(x){x*x} # Data x = (1:50) y = eq(x) # Make plot object p = qplot( x, y, xlab = "X-axis", ylab = "Y-axis", ) # Plot Equation c = curve(eq) # Combine data and function p + c #? In this case my data is generated using the function, but I want to understand how to use curve() with ggplot. 回答1: You probably want stat_function: library("ggplot2") eq <- function

How to fill NA with median?

阅读更多关于 How to fill NA with median?

问题 Example data: set.seed(1) df <- data.frame(years=sort(rep(2005:2010, 12)), months=1:12, value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)) head(df) years months value 1 2005 1 -0.6264538 2 2005 2 0.1836433 3 2005 3 -0.8356286 4 2005 4 1.5952808 5 2005 5 0.3295078 6 2005 6 -0.8204684 Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must