statistics | 易学教程

R - ggplot2 - Visualize deviations from base model

阅读更多关于 R - ggplot2 - Visualize deviations from base model

问题 base = c(1.84,3.92,1.67,1.12,1.63,.62,.59) e1 = c(.61,1.47,1.68,1.95,1.64,.61,.72) e2 = c(.64,7.08,1.67,1.12,1.44,.46,.76) e3 = c(.64,4.47,1.68,2.04,1.45,.4,1.35) e4 = c(.78,1.61,1.62,1.09,1.46,.66,.76) e5 = c(.78,.99,1.62,2.32,1.46,.73,.52) df = data.frame(base,e1,e2,e3,e4,e5) I have the following parameters from a baseline model and 5 other exploratory models. I'm trying to do as much job as possible for the reader so I'm thinking about going beyond tabling this out. Is there a way to plot

Ask to set the working directory in R Studio - multiple users working with the same R script

阅读更多关于 Ask to set the working directory in R Studio - multiple users working with the same R script

问题 We are three people using the same R script to work on our research project in R Studio . This brings some issues by setting the working directory , because the file and the data sheets are saved locally in everyones Dropbox folder. So we use the same script and the same data but the path to the working directory is for example like 'C:/Users/thoma/Dropbox/...' in my case. I can set the wd by setwd("directory") at the beginning of our code, but this works for me only. My Question : Is there a

How do I quantize data in pandas?

阅读更多关于 How do I quantize data in pandas?

问题 I have a DataFrame like this a = pd.DataFrame(a.random.random(5, 10), columns=['col1','col2','col3','col4','col5']) I'd like to quantize a specific column, say col4 , according to a set of thresholds (the corresponding output could be an integer from 0 to number of levels). Is there an API for that? 回答1: Most pandas objects are compatible with numpy functions. I would use numpy.digitize: import pandas as pd a = pd.DataFrame(pd.np.random.random((5, 5)), columns=['col1','col2','col3','col4',

how to understand the chi square contingency table

阅读更多关于 how to understand the chi square contingency table

问题 I have few categorical features: ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'] from scipy.stats import chi2_contingency chi2, p, dof, expected = chi2_contingency((pd.crosstab(df.Gender, df.Married).values)) print (f'Chi-square Statistic : {chi2} ,p-value: {p}') output: Chi-square Statistic : 79.63562874824729 ,p-value: 4.502328957824834e-19 How can I know if the features are independent from each other from these statistics? I am trying to build a

How to get NA values instead of a “data are essentially constant” error in t.test in R

阅读更多关于 How to get NA values instead of a “data are essentially constant” error in t.test in R

问题 I have a large dataset of data from two groups. I want to compare it using a t.test and get a list of p.values for all the columns starting with F_, but because of the data in some columns, when I use my code: TP_FN_ttest <- Map(t.test, x = TP[,grepl(paste0("^F_"),colnames(TP))], y = FN[,grepl(paste0("^F_"),colnames(FN))]) TP_FN_ttest.pval <- as.data.frame(sapply(TP_FN_ttest, '[[', 'p.value')) I get an error: Error in t.test.default(x = dots[[1L]][[508L]], y = dots[[2L]][[508L]]) : data are

Python 2.7 - Calculate quantiles per row

阅读更多关于 Python 2.7 - Calculate quantiles per row

问题 I have a pandas series like this: 0 1787 1 4789 2 1350 3 1476 4 0 5 747 6 307 7 147 8 221 9 -88 10 9374 11 264 12 1109 13 502 14 360 15 194 16 4073 17 2317 18 -221 20 0 21 16 22 106 29 105 30 4189 31 171 32 42 I want to create 4 one hot encoded variables that indicates which value per row is on which quartile, dividing the series into 4 quartiles. It would be something like this: 0 1787 Q1 Q2 Q3 Q4 1 4789 0 0 0 0 2 1350 0 0 0 1 3 1476 1 0 0 0 4 0 0 1 0 0 5 747 0 0 1 0 6 307 1 0 1 0 7 147 0 1

All binary predictors in a classification task

阅读更多关于 All binary predictors in a classification task

问题 I am performing my analysis using R, I will be implementing four algorithms. 1. RF 2. Log Reg 3. SVM 4. LDA I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I have the following questions: Should I convert them all into factors? Converting them into factors, and applying RF algorithms give 100% accuracy, I am very much surprised to see that as well. Also, for other algorithms, how should i treat my variables priorly, before

How to compute the p-value in hypothesis testing (linear regression)

阅读更多关于 How to compute the p-value in hypothesis testing (linear regression)

问题 Currently I'm working on an awk script to do some statistical analysis on measurement data. I'm using linear regression to get parameter estimates, standard errors etc. and would also like to compute the p-value for a null-hypothesis test (t-test). This is my script so far, any idea how to compute the p-value? BEGIN { ybar = 0.0 xbar = 0.0 n = 0 a0 = 0.0 b0 = 0.0 qtinf0975 = 1.960 # 5% n = inf } { # y_i is in $1, x_i has to be counted n = n + 1 yi[n] = $1*1.0 xi[n] = n*1.0 } END { for ( i = 1

How can I get aov to show me the F-statistic and p-value?

阅读更多关于 How can I get aov to show me the F-statistic and p-value?

问题 The following script #!/usr/bin/Rscript --vanilla x <- c(4.5,6.4,7.2,6.7,8.8,7.8,9.6,7.0,5.9,6.8,5.7,5.2) fertilizer<- factor(c('A','A','A','A','B','B','B','B','C','C','C','C')) crop <- factor(c('I','II','III','IV','I','II','III','IV','I','II','III','IV')) av <- aov(x~fertilizer*crop) summary(av) yields Df Sum Sq Mean Sq fertilizer 2 13.6800 6.8400 crop 3 2.8200 0.9400 fertilizer:crop 6 6.5800 1.0967 For other data, aov usually gives the F-statistic and associated p-value. What is wrong

Storing And Displaying Stats

阅读更多关于 Storing And Displaying Stats

问题 I am going to be writing some software in PHP to parse log files and aggregate the data then display them in graphs (like bar graphs, not vertices and edges). Yeah, it's basically business intelligence software which my company has an entire team for but apparently they don't do a great job (10 minutes to load a page just doesn't do it). Here is what i have to do: Log files are data files which stores the raw data from a stats server we have setup running from our office (we send asynchronous