scikit-learn

Creating scorer for Brier Score Loss in scikit-learn

限于喜欢 提交于 2021-02-08 08:13:25
问题 I'm trying to make use of GridSearchCV and RandomizedSearchCV in scikit-learn (0.16.1) for logistic regression and random forest classifiers (and possibly others down the road) for binary class problems. I managed to get GridSearchCV to work with the standard LogisticRegression classifier, but I cannot get LogisticRegressionCV to work (or RandomizedGridCV for the RandomForestClassifier) with a customized scoring function, specifically brier_score_loss. I have tried this code: lrcv =

sklearn - How to retrieve PCA components and explained variance from inside a Pipeline passed to GridSearchCV

我们两清 提交于 2021-02-08 06:51:35
问题 I am using GridSearchCV with a pipeline as follows: grid = GridSearchCV( Pipeline([ ('reduce_dim', PCA()), ('classify', RandomForestClassifier(n_jobs = -1)) ]), param_grid=[ { 'reduce_dim__n_components': range(0.7,0.9,0.1), 'classify__n_estimators': range(10,50,5), 'classify__max_features': ['auto', 0.2], 'classify__min_samples_leaf': [40,50,60], 'classify__criterion': ['gini', 'entropy'] } ], cv=5, scoring='f1') grid.fit(X,y) How do I now retrieve PCA details like components and explained

sklearn - How to retrieve PCA components and explained variance from inside a Pipeline passed to GridSearchCV

空扰寡人 提交于 2021-02-08 06:48:50
问题 I am using GridSearchCV with a pipeline as follows: grid = GridSearchCV( Pipeline([ ('reduce_dim', PCA()), ('classify', RandomForestClassifier(n_jobs = -1)) ]), param_grid=[ { 'reduce_dim__n_components': range(0.7,0.9,0.1), 'classify__n_estimators': range(10,50,5), 'classify__max_features': ['auto', 0.2], 'classify__min_samples_leaf': [40,50,60], 'classify__criterion': ['gini', 'entropy'] } ], cv=5, scoring='f1') grid.fit(X,y) How do I now retrieve PCA details like components and explained

How to implement n times repeated k-folds cross validation that yields n*k folds in sklearn?

只愿长相守 提交于 2021-02-08 06:23:11
问题 I got some trouble in implementing a cross validation setting that i saw in a paper. Basically it is explained in this attached picture: So, it says that they use 5 folds, which means k = 5 . But then, the authors said that they repeat the cross validation 20 times, which created 100 folds in total. Does that mean that i can just use this piece of code : kfold = StratifiedKFold(n_splits=100, shuffle=True, random_state=seed) Cause basically my code also yields 100-folds. Any recommendation?

Behavior of C in LinearSVC sklearn (scikit-learn)

时光毁灭记忆、已成空白 提交于 2021-02-08 06:01:14
问题 First I create some toy data: n_samples=20 X=np.concatenate((np.random.normal(loc=2, scale=1.0, size=n_samples),np.random.normal(loc=20.0, scale=1.0, size=n_samples),[10])).reshape(-1,1) y=np.concatenate((np.repeat(0,n_samples),np.repeat(1,n_samples+1))) plt.scatter(X,y) Below the graph to visualize the data: Then I train a model with LinearSVC from sklearn.svm import LinearSVC svm_lin = LinearSVC(C=1) svm_lin.fit(X,y) My understand for C is that: If C is very big, then misclassifications

Is it possible to add a covariate (control for a variable of no interest) to an SVM model?

不问归期 提交于 2021-02-08 05:48:24
问题 I'm very new to machine learning and python and I'm trying to build a model to predict patients (N=200) vs controls (N=200) form structural neuroimaging data. After the initial preprocessing were I reshaped the neuroimaging data into a 2D array I built the following model: from sklearn.svm import SVC svc = SVC(C=1.0, kernel='linear') from sklearn.grid_search import GridSearchCV from numpy import range k_range = np.arange(0.1,10,0.1) param_grid=dict(C=k_range) grid=GridSearchCV(svc, param_grid

tf-idf on a somewhat large (65k) amount of text files

十年热恋 提交于 2021-02-08 04:45:37
问题 I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree). I figure

Scipy cosine similarity vs sklearn cosine similarity

ⅰ亾dé卋堺 提交于 2021-02-08 04:30:28
问题 I noticed that both scipy and sklearn have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors: setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]" setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]" import1 = "from sklearn.metrics.pairwise import cosine_similarity" stmt1 = "[float(cosine

Pandas groupby in combination with sklean preprocessing continued

走远了吗. 提交于 2021-02-07 20:24:11
问题 Continue from this post: Pandas groupby in combination with sklearn preprocessing I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method import pandas as pd import numpy as np from sklearn.preprocessing import robust_scale,minmax_scale df = pd.DataFrame( dict( id=list('AAAAABBBBB'), loc = (10,20,10,20,10,20,10,20,10,20), value=(0,10,10,20,100,100,200,30,40,100))) df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale

How to create custom eval metric for catboost?

时间秒杀一切 提交于 2021-02-07 18:39:14
问题 Similar SO questions: Python Catboost: Multiclass F1 score custom metric Catboost tutorials https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function Question In this question, I have a binary classification problem. After modelling we get the test model predictions y_pred and we already have true test labels y_true . I would like to get the custom evaluation metric defined by following equation: profit = 400 * truePositive - 200*fasleNegative - 100