logistic-regression

How to fill an NPArray with another array starting at a key in Python?

拟墨画扇 提交于 2019-12-11 06:03:26
问题 I have a dataframe, called x . This consists of 2 columns which looks like this (712, 2): SibSp Parch 731 0 0 230 1 0 627 0 0 831 1 1 391 0 0 ................. Due to logistic regression needing a 'free weight', I build a newX variable with the shape of my x data frame but blank values. newX = np.zeros(shape=(x.shape[0], x.shape[1] + 1)) This generates a (712, 3) np array: [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.] ... [0. 0. 0.] [0. 0. 0.] [0. 0. 0.]] Since the first index (0) is a free weight, I

Why does the Logistic Regression cost go negative and not correct?

无人久伴 提交于 2019-12-11 05:27:43
问题 I am implementing logistic regression in Matlab. The data is normalized (mean and std). I understand that depending on your learning rate you may overshoot the optimal point. But doesn't that mean your cost starts going up? In my case the cost goes into negative territory, I don't understand why. Here is the standard (I think?) cost and weight update rule function J = crossEntropyError(w, x, y) h = sigmoid(x*w); J = (-y'*log(h) - (1-y')*log(1-h)); end Weight update: function w = updateWeights

R: Clustering standard errors in MASS::polr()

本秂侑毒 提交于 2019-12-11 05:14:50
问题 I am trying to estimate an ordinal logistic regression with clustered standard errors using the MASS package's polr() function. There is no built-in clustering feature, so I am looking for (a) packages or (b) manual methods for calculating clustered standard errors using the model output . I plan to use margins package to estimate marginal effects from the model. Here is an example: library(MASS) set.seed(1) obs <- 500 # Create data frame dat <- data.frame(y = as.factor(round(rnorm(n = obs,

scikit-learn LogisticRegressionCV: best coefficients

情到浓时终转凉″ 提交于 2019-12-11 05:09:40
问题 I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True. If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that

R lsmeans adjust multiple comparison

不想你离开。 提交于 2019-12-11 04:19:13
问题 I used lme4 to run a mixed effects logistig regression (by calling glmer) in R and now I am trying to do post-hoc comparisons. As they are pairwise, Tukey should be OK,but I would like to manually adjust for how many tests the correction should be made - now it is made for 12 tests, but I am only intersted in 6 comparisons. My code looks like this so far for (i in seq_along(logmixed_ranks)) { print(lsmeans(logmixed_ranks[[i]], pairwise~rating_ranks*indicator_var, adjust="tukey")) } Somehow I

Running a multivariate ordered logit in PyMC3

强颜欢笑 提交于 2019-12-11 02:49:22
问题 I'm trying to build a Bayesian multivariate ordered logit model using PyMC3. I have gotten a toy multivariate logit model working based on the examples in this book. I've also gotten an ordered logistic regression model running based on the example at the bottom of this page. However, I cannot get an ordered, multivariate logistic regression to run. I think the issue could be the way the cutpoints are specified, specifically the shape parameter, but I'm not sure why it would be different if

sklearn TimeSeriesSplit cross_val_predict only works for partitions

自古美人都是妖i 提交于 2019-12-11 01:41:25
问题 I am trying to use the TimeSeriesSplit cross-validation strategy in sklearn version 0.18.1 with a LogisticRegression estimator. I get an error stating that: cross_val_predict only works for partitions The following code snippet shows how to reproduce: from sklearn import linear_model, neighbors from sklearn.model_selection import train_test_split, cross_val_predict, TimeSeriesSplit, KFold, cross_val_score import pandas as pd import numpy as np from datetime import date, datetime df = pd

Stata drops variables that “predicts failure perfeclty” even though the correlation between the variables isn't 1 or -1?

让人想犯罪 __ 提交于 2019-12-10 21:26:31
问题 I am running a logit regression on some data. My dependent variable is binary as are all but one of my independent variables. When I run my regression, stata drops many of my independent variables and gives the error: "variable name" != 0 predicts failure perfectly "variable name" dropped and "a number" obs not used I know for a fact that some of the variables dropped don't predict failure perfectly. In other words, the dependent variables can take on the value 1 for either the value 1 or 0

How to interpret probability column in spark logistic regression prediction?

夙愿已清 提交于 2019-12-10 20:28:14
问题 I'm getting predictions through spark.ml.classification.LogisticRegressionModel.predict . A number of the rows have the prediction column as 1.0 and probability column as .04 . The model.getThreshold is 0.5 so I'd assume the model is classifying everything over a 0.5 probability threshold as 1.0 . How am I supposed to interpret a result with a 1.0 prediction and a probability of 0.04? 回答1: The probability column from performing a LogisticRegression should contain a list with the same length

Python sklearn Multilabel Classification : UserWarning: Label not 226 is present in all training examples

给你一囗甜甜゛ 提交于 2019-12-10 19:50:34
问题 I am trying out a Multilabel Classification problem. My data looks like this DocID Content Tags 1 some text here... [70] 2 some text here... [59] 3 some text here... [183] 4 some text here... [173] 5 some text here... [71] 6 some text here... [98] 7 some text here... [211] 8 some text here... [188] . ............. ..... . ............. ..... . ............. ..... here is my code traindf = pd.read_csv("mul.csv") print "This is what our training data looks like:" print traindf t=TfidfVectorizer