statistics

Relative frequency in r by factor

Deadly 提交于 2020-01-05 08:01:13
问题 I would like to get a table of top 10 absolute and relative frequencies for a variable across other factor variable. I have a dataframe with 3 columns: 1 column is a factor variable, 2nd is other variable I need to count, 3 is logical variable as a constraint. (real database has more than 4mln observations) dtf<-data.frame(c("a","a","b","c","b"),c("aaa","bbb","aaa","aaa","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE)) colnames(dtf)<-c("factor","var","log") dtf factor var log 1 a aaa TRUE 2 a bbb FALSE 3

How to calculate with the Poisson-Distribution in Matlab?

霸气de小男生 提交于 2020-01-05 06:57:21
问题 I’ve used Excel in the past but the calculations including the Poisson-Distribution took a while, that’s why I switched to SQL. Soon I’ve recognized that SQL might not be a proper solution to deal with statistical issues. Finally I’ve decided to switch to Matlab but I’m not used to it at all, my problem Is the following: I’ve imported a .csv-table and have two columns with values, let’s say A and B (110 x 1 double) These values both are the input values for my Poisson-calculations. Since I

How to calculate with the Poisson-Distribution in Matlab?

狂风中的少年 提交于 2020-01-05 06:57:04
问题 I’ve used Excel in the past but the calculations including the Poisson-Distribution took a while, that’s why I switched to SQL. Soon I’ve recognized that SQL might not be a proper solution to deal with statistical issues. Finally I’ve decided to switch to Matlab but I’m not used to it at all, my problem Is the following: I’ve imported a .csv-table and have two columns with values, let’s say A and B (110 x 1 double) These values both are the input values for my Poisson-calculations. Since I

MySQL store checksum of tables in another table

一曲冷凌霜 提交于 2020-01-05 04:05:19
问题 CONTEXT: we have big databases with loads of tables. Most of them (99%) are using innodb. we want to have a daily process that monitors which table has been modified. As they use innodb the value of Update_time from SHOW table STATUS from information_schema; is null. For that reason we want to create a daily procedure that will store the checksum (and other stuffs for that matters) of each table somewhere (preferably another table). On that, we will do different checks. PROBLEM: I'm trying to

Post-Hoc tests for chi-sq in R

情到浓时终转凉″ 提交于 2020-01-05 01:09:14
问题 I have a table that looks like this. > dput(theft_loc) structure(c(13704L, 14059L, 14263L, 14450L, 14057L, 15503L, 14230L, 16758L, 15289L, 15499L, 16066L, 15905L, 18531L, 19217L, 12410L, 13398L, 13308L, 13455L, 13083L, 14111L, 13068L, 19569L, 18771L, 19626L, 20290L, 19816L, 20923L, 20466L, 20517L, 19377L, 20035L, 20504L, 20393L, 22409L, 22289L, 7997L, 8106L, 7971L, 8437L, 8246L, 9090L, 8363L, 7934L, 7874L, 7909L, 8150L, 8191L, 8746L, 8277L, 27194L, 25220L, 26034L, 27080L, 27334L, 30819L,

vectorized indexing/slicing in numpy/scipy?

南楼画角 提交于 2020-01-04 07:51:19
问题 I have an array A, and I have a list of slicing indices (s,t), let's called this list L. I want to find the 85 percentiles of A[s1:t1], A[s2:t2] ... Is there a way to vectorize these operations in numpy? ans = [] for (s,t) in L: ans.append( numpy.percentile( A[s:t], 85) ); looks cumbersome. Thanks a lot! PS: it's safe to assume s1 < s2 .... t1 < t2 ..... This is really just a sliding window percentile problem. 回答1: Given that you're dealing with a non-uniform interval (i.e. the slices aren't

Calculate pvalue from pandas DataFrame

*爱你&永不变心* 提交于 2020-01-04 03:58:08
问题 I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample. In[13]:stats Out[13]: ARG/16S \ count mean std min sample gene Arnhem IC 11.0 2.319050e-03 7.396130e-04 1.503150e-03 Int1 11.0 7.243040e+00 6.848327e+00 1.364879e+00 Sul1 11.0 3.968956e-03 9.186019e-04 2.499074e-03 TetB 2.0 1.154748e-01 1.627663e-01 3.816936e-04 TetM 4.0 1.083125e-04 5.185259e-05 5.189226e-05 blaOXA 4.0 4.210963e-06 3.783235e-07 3.843571e-06 ermB 4.0 4.111081e-05 7

Calculate pvalue from pandas DataFrame

拥有回忆 提交于 2020-01-04 03:58:02
问题 I have a DataFrame stats with a Multindex and 8 samples (only two shown here) and 8 genes for each sample. In[13]:stats Out[13]: ARG/16S \ count mean std min sample gene Arnhem IC 11.0 2.319050e-03 7.396130e-04 1.503150e-03 Int1 11.0 7.243040e+00 6.848327e+00 1.364879e+00 Sul1 11.0 3.968956e-03 9.186019e-04 2.499074e-03 TetB 2.0 1.154748e-01 1.627663e-01 3.816936e-04 TetM 4.0 1.083125e-04 5.185259e-05 5.189226e-05 blaOXA 4.0 4.210963e-06 3.783235e-07 3.843571e-06 ermB 4.0 4.111081e-05 7

What's the best way of implementing a 'popular content' display?

和自甴很熟 提交于 2020-01-04 03:54:11
问题 How do I show a list of 'most popular (articles|posts|whatever) for a period such as the past day? (Essentially replicate the functionality of the Radioactivity Drupal module.) 回答1: Here's what I would do: If you're not already, sign up for Google Analytics and add the google analytics javascript to each of your pages. This will track view count for you. Using the google data API library, fetch the information you want. For example, you could ask for the most popular pages on your site in the

How to create a discrete normal distribution in R?

╄→尐↘猪︶ㄣ 提交于 2020-01-03 20:57:57
问题 I am trying to create a discrete normal distribution using something such as x <- rnorm(1000, mean = 350, sd = 20) but I don't think the rnorm function has a built in "discrete numbers only" option. I have spent a few hours trying to search this on StackOverflow, Google and R documentation but have yet to find anything. 回答1: Obviously, there is no discrete normal distribution as by default it is continuous. However, as mentioned here (Wikipedia is not the best possible source but this is