scikit-learn | 易学教程

Sklearn's PCA gives 'wrong' output for last row

阅读更多关于 Sklearn's PCA gives 'wrong' output for last row

问题 I am trying to run data through sklearn's PCA (n_components=2) and find that the y-value of the last row is different to the other values of the same input values. Notably, the input data only consist of two distinct entries and when changing the number of occurrences for an entry the error disappears. Please find the code below to replicate the error. import pandas as pd from sklearn.decomposition import PCA lst1 = [[-0.485886999,0,-0.485886999,-0.485886999,-0.485886999,0,-0.485886999,-0

How to keep track of columns after encoding categorical variables?

阅读更多关于 How to keep track of columns after encoding categorical variables?

问题 I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it? In the below code df_columns would tell me that column 0 in df_array is A , column 1 is B and so forth... However when once I encode categorical column B df_columns is no longer valid for keeping track of df_dummies import pandas as pd import numpy as np animal = ['dog','cat','horse'] df = pd.DataFrame({'A': np.random.rand(9), 'B': [animal[np.random.randint(3)] for i in range(9)],

Does the decision function in scikit-learn return the true distance to the hyperplane?

阅读更多关于 Does the decision function in scikit-learn return the true distance to the hyperplane?

问题 Does the decision function return the actual distance to the hyperplane for each sample as stated here. Or do you have to the extra calculation as shown here. Which method should be used? 回答1: No, that's not the actual distance. And depends on the case, you may (linear kernel) or may not (non-linear kernel) be able to convert that to an actually distance. Here is another good explanation. Not matter what, yes you have to take that extra step if you want the actual distance . 来源： https:/

how to fix ''Found input variables with inconsistent numbers of samples: [219, 247]''

阅读更多关于 how to fix ''Found input variables with inconsistent numbers of samples: [219, 247]''

问题 As title says when running the following code i get a trouble Found input variables with inconsistent numbers of samples: [219, 247], i have read that the problem should be on the np.array set for X and y, but i cannot address the problem because there is a price for every date so i dont get why it is happening, any help will be appreciated thanks! import pandas as pd import quandl, math, datetime import numpy as np from sklearn import preprocessing, svm, model_selection from sklearn.linear

Python import error “getfullargspec”

阅读更多关于 Python import error “getfullargspec”

问题 when I do: from sklearn import linear_model I get the error: ImportError: cannot import name 'getfullargspec' Interestingly, this does not happen a few days ago. So I tried to install python and scipy stack again on my computer using Anaconda again but does not solve the problem. What might be wrong with my system? Thank you in advance. 回答1: Install using pip pip install scipy And use pip to install the following packages such as numpy, pandas, etc... If you are using Python 3 then install

sci-kit learn TruncatedSVD explained_variance_ratio_ not in descending order? [duplicate]

阅读更多关于 sci-kit learn TruncatedSVD explained_variance_ratio_ not in descending order? [duplicate]

问题 This question already has an answer here : Why Sklearn TruncatedSVD's explained variance ratios are not in descending order? (1 answer) Closed 9 months ago . This question is actually a duplicate of this one, which however remains unanswered at the time of writing. Why is the explained_variance_ratio_ from TruncatedSVD not in descending order like it would be from PCA ? In my experience it seems that the first element of the list is always the lowest, and then at the second element the value

Scikit Learn sklearn.linear_model.LinearRegression: View the results of the model generated

阅读更多关于 Scikit Learn sklearn.linear_model.LinearRegression: View the results of the model generated

问题 So, I can get sklearn.linear_model.LinearRegression to process my data - at least to run the script without raising any exceptions or warnings. The only issue is, that I am not trying to plot the results with matplotlib, but instead I want to see the estimators and diagnostic statistics for the model. How can I get a model summary such as the slope and intercept (B0,B1), R squared adjusted, etc to display in the console or populate into a variable instead of plotting this? This is a generic

Explain matplotlib contourf function

阅读更多关于 Explain matplotlib contourf function

问题 I am trying to plot a decision region (based on the output of a logistic regression) with matplotlib contourf funtion. The code I am using: subplot.contourf(x2, y2, P, cmap=cmap_light, alpha = 0.8) where x2 and y2 are two 2D matrices generated via numpy meshgrids. P is computed using P = clf.predict(numpy.c_[x2.ravel(), y2.ravel()]) P = P.reshape(x2.shape) Each element of P is a boolean value based on the output of the logistic regresssion. The rendered plot looks like this My question is how

How to define specificity as a callable scorer for model evaluation

阅读更多关于 How to define specificity as a callable scorer for model evaluation

问题 I am using this code to compare performance of a number of models: from sklearn import model_selection X = input data Y = binary labels models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) results = [] names = [] scoring = 'accuracy' for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=7) cv_results = model

Pandas column of lists to separate columns

阅读更多关于 Pandas column of lists to separate columns

问题 Problem Incoming data is a list of 0+ categories: #input data frame df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))}) categories 0 [A, B, C] 1 [B, C] 2 [A] I would like to convert this to a DataFrame with one column per category and a 0/1 in each cell: #desired output A B C 0 1 1 1 1 0 1 1 2 1 0 0 Attempt OneHotEncoder with LabelEncoder get stuck because they don't handle lists in cells. The desired result is currently achieved with nested for loops: #get unique