statistics | 易学教程

Inverse Document Frequency Formula

阅读更多关于 Inverse Document Frequency Formula

问题 I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect. I keep reading that idf(term) = log(# of docs/ # of docs with term) If so, won't you get a divide by zero error if there are no docs with the term? To solve that problem, I read that you do log (# of docs / # of docs with term + 1 ) But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me. What am

How to use `hclust` function from R in Python via Rpy2 (v3)?

阅读更多关于 How to use `hclust` function from R in Python via Rpy2 (v3)?

问题 There are a lot of changes between rpy2 v2 and v3. I'm porting my code and patching up some compatibility issues. One thing I can't figure out is how to get hclust to work. Specifically from the fastcluster package but I can't even get base hclust to work. A few things I do not understand: (1) Should I use R["as.dist"](rkernel) or R("as.dist")(rkernel) ? (2) Why does this return a numpy array when I'm calling it within R? (3) How can I get this disimilarity object to work with hclust and

Error in Friedman test

阅读更多关于 Error in Friedman test

问题 Good morning, I wanted to do a Friedman test (from "stats" package) on my data about the leaves of dandelions, but an error is displayed. My data has the form: > str (mi) 'data.frame': 4393 obs. of 18 variables: $ OS_Gatunek : Factor w/ 5 levels "Taraxacum ancistrolobum",..: 1 1 1 1 1 1 1 1 1 1 ... $ PH_CreateDate : Factor w/ 15 levels "2016-04-06","2016-04-19",..: 2 2 2 2 2 2 2 2 2 2 ... $ L_Dl : num 7.91 8.96 10.18 10.09 9.4 ... $ L_SzerMaksOs : num 1.93 3.98 3.12 4.04 2.75 2.69 3.69 3.23 2

Error in Friedman test

阅读更多关于 Error in Friedman test

how to sample from an upside down bell curve

阅读更多关于 how to sample from an upside down bell curve

问题 I can generate numbers with uniform distribution by using the code below: runif(1,min=10,max=20) How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve") 回答1: Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines min <- 1 max <- 20 q <- min + (max-min)*rbeta(10000, 0.5, 0.5) As @Gregor-reinstateMonica

Django & Postgres - percentile (median) and group by

阅读更多关于 Django & Postgres - percentile (median) and group by

问题 I need to calculate period medians per seller ID (see simplyfied model below). The problem is I am unable to construct the ORM query. Model class MyModel: period = models.IntegerField(null=True, default=None) seller_ids = ArrayField(models.IntegerField(), default=list) aux = JSONField(default=dict) Query queryset = ( MyModel.objects.filter(period=25) .annotate(seller_id=Func(F("seller_ids"), function="unnest")) .values("seller_id") .annotate( duration=Cast(KeyTextTransform("duration", "aux"),

Django & Postgres - percentile (median) and group by

阅读更多关于 Django & Postgres - percentile (median) and group by

Reverse Box-Cox transformation

阅读更多关于 Reverse Box-Cox transformation

问题 I am using SciPy's boxcox function to perform a Box-Cox transformation on a continuous variable. from scipy.stats import boxcox import numpy as np y = np.random.random(100) y_box, lambda_ = ss.boxcox(y + 1) # Add 1 to be able to transform 0 values Then, I fit a statistical model to predict the values of this Box-Cox transformed variable. The model predictions are in the Box-Cox scale and I want to transform them to the original scale of the variable. from sklearn.ensemble import

How to find the mean and standard deviation of rows in dataframes with some having NAs and others not

阅读更多关于 How to find the mean and standard deviation of rows in dataframes with some having NAs and others not

问题 My lab has separate groups for parents and children in the study. We have the data collected in one data frame right now. There are specific questions asked with children and some asked to parents. We have named them SCAREDC (scared child) and SCAREDP(scared parent) respectively. Naturally, SCAREDC will have NAs for the parents and SCAREDP will have NAs for the children in the dataframe. currently, my dataframe looks like this head(child_parent_total familySID time SCAREDC1 SCAREDC2 SCAREDC3

How to use the spark stats?

阅读更多关于 How to use the spark stats?

问题 I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles i.e. percentile 0, percentile 25, etc, on each column of my given data. As I am doing multiple percentiles, how to retrieve each calculated percentile from the results? Here an example, having data as show below: +----+---------+-------------+----------+-----------+ | id| date|total_revenue|con_dist_1| con_dist_2| +----+---------+-------------+----------+-----------+ |3310|1/15/2018| 0.010680705| 6|0.019875458| |3310|1/15/2018| 0