statistics

R - ggplot2 - Visualize deviations from base model

故事扮演 提交于 2019-12-24 14:06:40
问题 base = c(1.84,3.92,1.67,1.12,1.63,.62,.59) e1 = c(.61,1.47,1.68,1.95,1.64,.61,.72) e2 = c(.64,7.08,1.67,1.12,1.44,.46,.76) e3 = c(.64,4.47,1.68,2.04,1.45,.4,1.35) e4 = c(.78,1.61,1.62,1.09,1.46,.66,.76) e5 = c(.78,.99,1.62,2.32,1.46,.73,.52) df = data.frame(base,e1,e2,e3,e4,e5) I have the following parameters from a baseline model and 5 other exploratory models. I'm trying to do as much job as possible for the reader so I'm thinking about going beyond tabling this out. Is there a way to plot

Ask to set the working directory in R Studio - multiple users working with the same R script

人盡茶涼 提交于 2019-12-24 14:06:11
问题 We are three people using the same R script to work on our research project in R Studio . This brings some issues by setting the working directory , because the file and the data sheets are saved locally in everyones Dropbox folder. So we use the same script and the same data but the path to the working directory is for example like 'C:/Users/thoma/Dropbox/...' in my case. I can set the wd by setwd("directory") at the beginning of our code, but this works for me only. My Question : Is there a

How do I quantize data in pandas?

喜夏-厌秋 提交于 2019-12-24 13:22:45
问题 I have a DataFrame like this a = pd.DataFrame(a.random.random(5, 10), columns=['col1','col2','col3','col4','col5']) I'd like to quantize a specific column, say col4 , according to a set of thresholds (the corresponding output could be an integer from 0 to number of levels). Is there an API for that? 回答1: Most pandas objects are compatible with numpy functions. I would use numpy.digitize: import pandas as pd a = pd.DataFrame(pd.np.random.random((5, 5)), columns=['col1','col2','col3','col4',

how to understand the chi square contingency table

Deadly 提交于 2019-12-24 09:25:33
问题 I have few categorical features: ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'] from scipy.stats import chi2_contingency chi2, p, dof, expected = chi2_contingency((pd.crosstab(df.Gender, df.Married).values)) print (f'Chi-square Statistic : {chi2} ,p-value: {p}') output: Chi-square Statistic : 79.63562874824729 ,p-value: 4.502328957824834e-19 How can I know if the features are independent from each other from these statistics? I am trying to build a

How to get NA values instead of a “data are essentially constant” error in t.test in R

与世无争的帅哥 提交于 2019-12-24 08:17:31
问题 I have a large dataset of data from two groups. I want to compare it using a t.test and get a list of p.values for all the columns starting with F_, but because of the data in some columns, when I use my code: TP_FN_ttest <- Map(t.test, x = TP[,grepl(paste0("^F_"),colnames(TP))], y = FN[,grepl(paste0("^F_"),colnames(FN))]) TP_FN_ttest.pval <- as.data.frame(sapply(TP_FN_ttest, '[[', 'p.value')) I get an error: Error in t.test.default(x = dots[[1L]][[508L]], y = dots[[2L]][[508L]]) : data are

Python 2.7 - Calculate quantiles per row

做~自己de王妃 提交于 2019-12-24 08:03:09
问题 I have a pandas series like this: 0 1787 1 4789 2 1350 3 1476 4 0 5 747 6 307 7 147 8 221 9 -88 10 9374 11 264 12 1109 13 502 14 360 15 194 16 4073 17 2317 18 -221 20 0 21 16 22 106 29 105 30 4189 31 171 32 42 I want to create 4 one hot encoded variables that indicates which value per row is on which quartile, dividing the series into 4 quartiles. It would be something like this: 0 1787 Q1 Q2 Q3 Q4 1 4789 0 0 0 0 2 1350 0 0 0 1 3 1476 1 0 0 0 4 0 0 1 0 0 5 747 0 0 1 0 6 307 1 0 1 0 7 147 0 1

All binary predictors in a classification task

天大地大妈咪最大 提交于 2019-12-24 07:24:21
问题 I am performing my analysis using R, I will be implementing four algorithms. 1. RF 2. Log Reg 3. SVM 4. LDA I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I have the following questions: Should I convert them all into factors? Converting them into factors, and applying RF algorithms give 100% accuracy, I am very much surprised to see that as well. Also, for other algorithms, how should i treat my variables priorly, before

How to compute the p-value in hypothesis testing (linear regression)

自闭症网瘾萝莉.ら 提交于 2019-12-24 07:13:03
问题 Currently I'm working on an awk script to do some statistical analysis on measurement data. I'm using linear regression to get parameter estimates, standard errors etc. and would also like to compute the p-value for a null-hypothesis test (t-test). This is my script so far, any idea how to compute the p-value? BEGIN { ybar = 0.0 xbar = 0.0 n = 0 a0 = 0.0 b0 = 0.0 qtinf0975 = 1.960 # 5% n = inf } { # y_i is in $1, x_i has to be counted n = n + 1 yi[n] = $1*1.0 xi[n] = n*1.0 } END { for ( i = 1

How can I get aov to show me the F-statistic and p-value?

南笙酒味 提交于 2019-12-24 07:04:06
问题 The following script #!/usr/bin/Rscript --vanilla x <- c(4.5,6.4,7.2,6.7,8.8,7.8,9.6,7.0,5.9,6.8,5.7,5.2) fertilizer<- factor(c('A','A','A','A','B','B','B','B','C','C','C','C')) crop <- factor(c('I','II','III','IV','I','II','III','IV','I','II','III','IV')) av <- aov(x~fertilizer*crop) summary(av) yields Df Sum Sq Mean Sq fertilizer 2 13.6800 6.8400 crop 3 2.8200 0.9400 fertilizer:crop 6 6.5800 1.0967 For other data, aov usually gives the F-statistic and associated p-value. What is wrong

Storing And Displaying Stats

烈酒焚心 提交于 2019-12-24 06:49:53
问题 I am going to be writing some software in PHP to parse log files and aggregate the data then display them in graphs (like bar graphs, not vertices and edges). Yeah, it's basically business intelligence software which my company has an entire team for but apparently they don't do a great job (10 minutes to load a page just doesn't do it). Here is what i have to do: Log files are data files which stores the raw data from a stats server we have setup running from our office (we send asynchronous