statistics

Inverse Document Frequency Formula

谁说胖子不能爱 提交于 2020-06-15 07:25:38
问题 I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect. I keep reading that idf(term) = log(# of docs/ # of docs with term) If so, won't you get a divide by zero error if there are no docs with the term? To solve that problem, I read that you do log (# of docs / # of docs with term + 1 ) But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me. What am

How to use `hclust` function from R in Python via Rpy2 (v3)?

≡放荡痞女 提交于 2020-06-01 04:22:11
问题 There are a lot of changes between rpy2 v2 and v3. I'm porting my code and patching up some compatibility issues. One thing I can't figure out is how to get hclust to work. Specifically from the fastcluster package but I can't even get base hclust to work. A few things I do not understand: (1) Should I use R["as.dist"](rkernel) or R("as.dist")(rkernel) ? (2) Why does this return a numpy array when I'm calling it within R? (3) How can I get this disimilarity object to work with hclust and

Error in Friedman test

岁酱吖の 提交于 2020-05-30 03:29:27
问题 Good morning, I wanted to do a Friedman test (from "stats" package) on my data about the leaves of dandelions, but an error is displayed. My data has the form: > str (mi) 'data.frame': 4393 obs. of 18 variables: $ OS_Gatunek : Factor w/ 5 levels "Taraxacum ancistrolobum",..: 1 1 1 1 1 1 1 1 1 1 ... $ PH_CreateDate : Factor w/ 15 levels "2016-04-06","2016-04-19",..: 2 2 2 2 2 2 2 2 2 2 ... $ L_Dl : num 7.91 8.96 10.18 10.09 9.4 ... $ L_SzerMaksOs : num 1.93 3.98 3.12 4.04 2.75 2.69 3.69 3.23 2

Error in Friedman test

二次信任 提交于 2020-05-30 03:26:20
问题 Good morning, I wanted to do a Friedman test (from "stats" package) on my data about the leaves of dandelions, but an error is displayed. My data has the form: > str (mi) 'data.frame': 4393 obs. of 18 variables: $ OS_Gatunek : Factor w/ 5 levels "Taraxacum ancistrolobum",..: 1 1 1 1 1 1 1 1 1 1 ... $ PH_CreateDate : Factor w/ 15 levels "2016-04-06","2016-04-19",..: 2 2 2 2 2 2 2 2 2 2 ... $ L_Dl : num 7.91 8.96 10.18 10.09 9.4 ... $ L_SzerMaksOs : num 1.93 3.98 3.12 4.04 2.75 2.69 3.69 3.23 2

how to sample from an upside down bell curve

心不动则不痛 提交于 2020-05-29 06:50:22
问题 I can generate numbers with uniform distribution by using the code below: runif(1,min=10,max=20) How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve") 回答1: Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines min <- 1 max <- 20 q <- min + (max-min)*rbeta(10000, 0.5, 0.5) As @Gregor-reinstateMonica

Django & Postgres - percentile (median) and group by

微笑、不失礼 提交于 2020-05-28 07:27:46
问题 I need to calculate period medians per seller ID (see simplyfied model below). The problem is I am unable to construct the ORM query. Model class MyModel: period = models.IntegerField(null=True, default=None) seller_ids = ArrayField(models.IntegerField(), default=list) aux = JSONField(default=dict) Query queryset = ( MyModel.objects.filter(period=25) .annotate(seller_id=Func(F("seller_ids"), function="unnest")) .values("seller_id") .annotate( duration=Cast(KeyTextTransform("duration", "aux"),

Django & Postgres - percentile (median) and group by

别来无恙 提交于 2020-05-28 07:27:27
问题 I need to calculate period medians per seller ID (see simplyfied model below). The problem is I am unable to construct the ORM query. Model class MyModel: period = models.IntegerField(null=True, default=None) seller_ids = ArrayField(models.IntegerField(), default=list) aux = JSONField(default=dict) Query queryset = ( MyModel.objects.filter(period=25) .annotate(seller_id=Func(F("seller_ids"), function="unnest")) .values("seller_id") .annotate( duration=Cast(KeyTextTransform("duration", "aux"),

Reverse Box-Cox transformation

隐身守侯 提交于 2020-05-24 21:13:08
问题 I am using SciPy's boxcox function to perform a Box-Cox transformation on a continuous variable. from scipy.stats import boxcox import numpy as np y = np.random.random(100) y_box, lambda_ = ss.boxcox(y + 1) # Add 1 to be able to transform 0 values Then, I fit a statistical model to predict the values of this Box-Cox transformed variable. The model predictions are in the Box-Cox scale and I want to transform them to the original scale of the variable. from sklearn.ensemble import

How to find the mean and standard deviation of rows in dataframes with some having NAs and others not

点点圈 提交于 2020-05-24 07:35:12
问题 My lab has separate groups for parents and children in the study. We have the data collected in one data frame right now. There are specific questions asked with children and some asked to parents. We have named them SCAREDC (scared child) and SCAREDP(scared parent) respectively. Naturally, SCAREDC will have NAs for the parents and SCAREDP will have NAs for the children in the dataframe. currently, my dataframe looks like this head(child_parent_total familySID time SCAREDC1 SCAREDC2 SCAREDC3

How to use the spark stats?

ⅰ亾dé卋堺 提交于 2020-05-17 06:54:31
问题 I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles i.e. percentile 0, percentile 25, etc, on each column of my given data. As I am doing multiple percentiles, how to retrieve each calculated percentile from the results? Here an example, having data as show below: +----+---------+-------------+----------+-----------+ | id| date|total_revenue|con_dist_1| con_dist_2| +----+---------+-------------+----------+-----------+ |3310|1/15/2018| 0.010680705| 6|0.019875458| |3310|1/15/2018| 0