statistics

R - cox hazard model not including levels of a factor

浪子不回头ぞ 提交于 2021-02-19 23:59:13
问题 I am fitting a cox model to some data that is structured as such: str(test) 'data.frame': 147 obs. of 8 variables: $ AGE : int 71 69 90 78 61 74 78 78 81 45 ... $ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ... $ RACE : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ... $ SIDE : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ... $ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ... $ RUTH.CLASS : int 3 5 4 5 4 3 3 5 5 5 ... $

p-value from fisher.test() does not match phyper()

感情迁移 提交于 2021-02-19 04:42:16
问题 The Fisher's Exact Test is related to the hypergeometric distribution, and I would expect that these two commands would return identical pvalues. Can anyone explain what I'm doing wrong that they do not match? #data (variable names chosen to match dhyper() argument names) x = 14 m = 20 n = 41047 k = 40 #Fisher test, alternative = 'greater' (fisher.test(matrix(c(x, m-x, k-x, n-(k-x)),2,2), alternative='greater'))$p.value #returns 2.01804e-39 #geometric distribution, lower.tail = F, i.e. P[X >

p-value from fisher.test() does not match phyper()

流过昼夜 提交于 2021-02-19 04:42:05
问题 The Fisher's Exact Test is related to the hypergeometric distribution, and I would expect that these two commands would return identical pvalues. Can anyone explain what I'm doing wrong that they do not match? #data (variable names chosen to match dhyper() argument names) x = 14 m = 20 n = 41047 k = 40 #Fisher test, alternative = 'greater' (fisher.test(matrix(c(x, m-x, k-x, n-(k-x)),2,2), alternative='greater'))$p.value #returns 2.01804e-39 #geometric distribution, lower.tail = F, i.e. P[X >

Using Pandas to sample DataFrame using a specific column's weight

夙愿已清 提交于 2021-02-18 11:43:27
问题 I have a DataFrame which look like: index name city 0 Yam Hadera 1 Meow Hadera 2 Don Hadera 3 Jazz Hadera 4 Bond Tel Aviv 5 James Tel Aviv I want Pandas to randomly choose values, using the number of appearances in the city column (kind of using: df.city.value_counts() ), so the results of my magic function, suppose: df.magic_sample(3, weight_column='city') might look like: 0 Yam Hadera 1 Meow Hadera 2 Bond Tel Aviv Thanks! :) 回答1: You can group by city and then sample each group based on

dplyr filtering on multiple columns using “%in%”

徘徊边缘 提交于 2021-02-17 06:38:14
问题 I have a dataframe (df1) with multiple columns (ID, Number, Location, Field, Weight). I also have another dataframe (df2) with more information (ID, PassRate, Number, Weight). I am trying to use dplyr and %in% to filter out rows in df1 that have the same two values as df2. So far I have: df_sub <- subset(df1, df1$ID %in% df2$ID & df1$Weight %in% df2$Weight) But this is only subsetting on the first condition...any idea why? 回答1: From the question and sample code, it is unclear whether you want

Find the minma /valley points and get the index where the valley starts and valley ends in R

浪子不回头ぞ 提交于 2021-02-16 15:29:06
问题 I am kind of new to Statistics and R.I have a requirement to find the peaks and valleys and the index where the peak/valley starts and ends. For the Maxima/peak i got the findPeaks function which helps me with the peak requirement.But i am unable to find any packages for finding the valley points that suits my requirement. The following is the R function for finding the peaks. function (x, nups = 1, ndowns = nups, zero = "0", peakpat = NULL, minpeakheight = -Inf, minpeakdistance = 1,

Why does the standardscaler have different effects under different number of features

冷暖自知 提交于 2021-02-16 15:16:38
问题 I experimented with breast cancer data from scikit-learn. Use all features and not use standardscaler: cancer = datasets.load_breast_cancer() x = cancer.data y = cancer.target x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) pla = Perceptron().fit(x_train, y_train) y_pred = pla.predict(x_test) print(accuracy_score(y_test, y_pred)) result 1 : 0.9473684210526315 Use all features and use standardscaler: cancer = datasets.load_breast_cancer() x = cancer

Generate random data based on existing data

好久不见. 提交于 2021-02-16 14:52:16
问题 is there a way in python to generate random data based on the distribution of the alreday existing data? Here are the statistical parameters of my dataset: Data count 209.000000 mean 1.280144 std 0.374602 min 0.880000 25% 1.060000 50% 1.150000 75% 1.400000 max 4.140000 as it is no normal distribution it is not possible to do it with np.random.normal. Any Ideas? Thank you. Edit: Performing KDE: from sklearn.neighbors import KernelDensity # Gaussian KDE kde = KernelDensity(kernel='gaussian',

Compute a confidence interval from sample data assuming unknown distribution

让人想犯罪 __ 提交于 2021-02-12 11:32:07
问题 I have sample data which I would like to compute a confidence interval for, assuming a distribution is not normal and is unknown. Basically, it looks like distribution is Pareto but I don't know for sure. The answers for the normal distribution: Compute a confidence interval from sample data Correct way to obtain confidence interval with scipy 回答1: If you don't know the underlying distribution, then my first thought would be to use bootstrapping: https://en.wikipedia.org/wiki/Bootstrapping_

Compute a confidence interval from sample data assuming unknown distribution

女生的网名这么多〃 提交于 2021-02-12 11:31:10
问题 I have sample data which I would like to compute a confidence interval for, assuming a distribution is not normal and is unknown. Basically, it looks like distribution is Pareto but I don't know for sure. The answers for the normal distribution: Compute a confidence interval from sample data Correct way to obtain confidence interval with scipy 回答1: If you don't know the underlying distribution, then my first thought would be to use bootstrapping: https://en.wikipedia.org/wiki/Bootstrapping_