logistic-regression

logistic regression python solvers' defintions

核能气质少年 提交于 2019-11-29 20:17:42
I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem. Can someone briefly describe what "newton-cg", "sag", "lbfgs" and "liblinear" are doing? If not, any related links or reading materials are much appreciated too. Thanks a lot in advance. Well, I hope I'm not too late to the party! Let me first try to establish some intuition before digging in loads of information ( warning : this is not brief comparison) Introduction A hypothesis h(x) , takes an input and gives us the

Assessing/Improving prediction with linear discriminant analysis or logistic regression

℡╲_俬逩灬. 提交于 2019-11-29 15:18:03
问题 I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis. Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with

Vectorization of logistic regression cost

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-29 12:46:24
问题 I have this code for the cost in logistic regression, in matlab: function [J, grad] = costFunction(theta, X, y) m = length(y); % number of training examples thetas = size(theta,1); features = size(X,2); steps = 100; alpha = 0.1; J = 0; grad = zeros(size(theta)); sums = []; result = 0; for i=1:m % sums = [sums; (y(i))*log10(sigmoid(X(i,:)*theta))+(1-y(i))*log10(1-sigmoid(X(i,:)*theta))] sums = [sums; -y(i)*log(sigmoid(theta'*X(i,:)'))-(1-y(i))*log(1-sigmoid(theta'*X(i,:)'))]; %use log simple

Spark Java Error: Size exceeds Integer.MAX_VALUE

两盒软妹~` 提交于 2019-11-29 07:35:17
问题 I am trying to use spark for some simple machine learning task. I used pyspark and spark 1.2.0 to do a simple logistic regression problem. I have 1.2 million records for training, and I hashed the features of the records. When I set the number of hashed features as 1024, the program works fine, but when I set the number of hashed features as 16384, the program fails several times with the following error: Py4JJavaError: An error occurred while calling o84.trainLogisticRegressionModelWithSGD.

Binary classification in TensorFlow, unexpected large values for loss and accuracy

99封情书 提交于 2019-11-29 05:01:32
I am trying to use a deep neural network architecture to classify against a binary label value - -1 and +1. Here is my code to do it in tensorflow . import tensorflow as tf import numpy as np from preprocess import create_feature_sets_and_labels train_x,train_y,test_x,test_y = create_feature_sets_and_labels() x = tf.placeholder('float', [None, 5]) y = tf.placeholder('float') n_nodes_hl1 = 500 n_nodes_hl2 = 500 n_nodes_hl3 = 500 n_classes = 1 batch_size = 100 def neural_network_model(data): hidden_1_layer = {'weights':tf.Variable(tf.random_normal([5, n_nodes_hl1])), 'biases':tf.Variable(tf

Is my implementation of stochastic gradient descent correct?

和自甴很熟 提交于 2019-11-29 01:00:02
问题 I am trying to develop stochastic gradient descent, but I don't know if it is 100% correct. The cost generated by my stochastic gradient descent algorithm is sometimes very far from the one generated by FMINUC or Batch gradient descent. while batch gradient descent cost converge when I set a learning rate alpha of 0.2, I am forced to set a learning rate alpha of 0.0001 for my stochastic implementation for it not to diverge. Is this normal? Here are some results I obtained with a training set

Comparison of R, statmodels, sklearn for a classification task with logistic regression

你说的曾经没有我的故事 提交于 2019-11-28 21:36:47
I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with what is returned by sklearn. I would like to understand why these results are different. I understand that it is probably not the same optimization algorithms that are used under the wood. Specifically, I use the standard Default dataset (used in the ISL book ). The following Python code reads the data into a dataframe Default . import pandas as pd # data is available here Default = pd.read_csv('https://d1pqsl2386xqi9

ValueError: Unknown label type: 'unknown'

£可爱£侵袭症+ 提交于 2019-11-28 19:15:11
I try to run following code. Btw, I am new to both python and sklearn. import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression # data import and preparation trainData = pd.read_csv('train.csv') train = trainData.values testData = pd.read_csv('test.csv') test = testData.values X = np.c_[train[:, 0], train[:, 2], train[:, 6:7], train[:, 9]] X = np.nan_to_num(X) y = train[:, 1] Xtest = np.c_[test[:, 0:1], test[:, 5:6], test[:, 8]] Xtest = np.nan_to_num(Xtest) # model lr = LogisticRegression() lr.fit(X, y) where y is a np.ndarray of 0's and 1's I receive the

Correctness of logistic regression in Vowpal Wabbit?

泄露秘密 提交于 2019-11-28 16:54:42
I have started using Vowpal Wabbit for logistic regression, however I am unable to reproduce the results it gives. Perhaps there is some undocumented "magic" it does, but has anyone been able to replicate / verify / check the calculations for logistic regression? For example, with the simple data below, we aim to model the way age predicts label . It is obvious there is a strong relationship as when age increases the probability of observing 1 increases. As a simple unit test, I used the 12 rows of data below: age label 20 0 25 0 30 0 35 0 40 0 50 0 60 1 65 0 70 1 75 1 77 1 80 1 Now,

How to find the importance of the features for a logistic regression model?

十年热恋 提交于 2019-11-28 15:48:39
I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is. Lets say there are features like size of tumor, weight of tumor, and etc to make a