scikit-learn | 易学教程

Creating scorer for Brier Score Loss in scikit-learn

阅读更多关于 Creating scorer for Brier Score Loss in scikit-learn

问题 I'm trying to make use of GridSearchCV and RandomizedSearchCV in scikit-learn (0.16.1) for logistic regression and random forest classifiers (and possibly others down the road) for binary class problems. I managed to get GridSearchCV to work with the standard LogisticRegression classifier, but I cannot get LogisticRegressionCV to work (or RandomizedGridCV for the RandomForestClassifier) with a customized scoring function, specifically brier_score_loss. I have tried this code: lrcv =

sklearn - How to retrieve PCA components and explained variance from inside a Pipeline passed to GridSearchCV

阅读更多关于 sklearn - How to retrieve PCA components and explained variance from inside a Pipeline passed to GridSearchCV

问题 I am using GridSearchCV with a pipeline as follows: grid = GridSearchCV( Pipeline([ ('reduce_dim', PCA()), ('classify', RandomForestClassifier(n_jobs = -1)) ]), param_grid=[ { 'reduce_dim__n_components': range(0.7,0.9,0.1), 'classify__n_estimators': range(10,50,5), 'classify__max_features': ['auto', 0.2], 'classify__min_samples_leaf': [40,50,60], 'classify__criterion': ['gini', 'entropy'] } ], cv=5, scoring='f1') grid.fit(X,y) How do I now retrieve PCA details like components and explained

sklearn - How to retrieve PCA components and explained variance from inside a Pipeline passed to GridSearchCV

阅读更多关于 sklearn - How to retrieve PCA components and explained variance from inside a Pipeline passed to GridSearchCV

How to implement n times repeated k-folds cross validation that yields n*k folds in sklearn?

阅读更多关于 How to implement n times repeated k-folds cross validation that yields n*k folds in sklearn?

问题 I got some trouble in implementing a cross validation setting that i saw in a paper. Basically it is explained in this attached picture: So, it says that they use 5 folds, which means k = 5 . But then, the authors said that they repeat the cross validation 20 times, which created 100 folds in total. Does that mean that i can just use this piece of code : kfold = StratifiedKFold(n_splits=100, shuffle=True, random_state=seed) Cause basically my code also yields 100-folds. Any recommendation?

Behavior of C in LinearSVC sklearn (scikit-learn)

阅读更多关于 Behavior of C in LinearSVC sklearn (scikit-learn)

问题 First I create some toy data: n_samples=20 X=np.concatenate((np.random.normal(loc=2, scale=1.0, size=n_samples),np.random.normal(loc=20.0, scale=1.0, size=n_samples),[10])).reshape(-1,1) y=np.concatenate((np.repeat(0,n_samples),np.repeat(1,n_samples+1))) plt.scatter(X,y) Below the graph to visualize the data: Then I train a model with LinearSVC from sklearn.svm import LinearSVC svm_lin = LinearSVC(C=1) svm_lin.fit(X,y) My understand for C is that: If C is very big, then misclassifications

Is it possible to add a covariate (control for a variable of no interest) to an SVM model?

阅读更多关于 Is it possible to add a covariate (control for a variable of no interest) to an SVM model?

问题 I'm very new to machine learning and python and I'm trying to build a model to predict patients (N=200) vs controls (N=200) form structural neuroimaging data. After the initial preprocessing were I reshaped the neuroimaging data into a 2D array I built the following model: from sklearn.svm import SVC svc = SVC(C=1.0, kernel='linear') from sklearn.grid_search import GridSearchCV from numpy import range k_range = np.arange(0.1,10,0.1) param_grid=dict(C=k_range) grid=GridSearchCV(svc, param_grid

tf-idf on a somewhat large (65k) amount of text files

阅读更多关于 tf-idf on a somewhat large (65k) amount of text files

问题 I want to try tfidf with scikit-learn (or nltk or am open to other suggestions). The data I have is a relatively large amount of discussion forum posts (~65k) we have scraped and stored in a mongoDB. Each post has a Post title, Date and Time of post, Text of the post message (or a re: if a reply to an existing post), User name, message ID and whether it is a child or parent post (in a thread, where you have the original post, and then replies to this op, or nested replies, the tree). I figure

Scipy cosine similarity vs sklearn cosine similarity

阅读更多关于 Scipy cosine similarity vs sklearn cosine similarity

问题 I noticed that both scipy and sklearn have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors: setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]" setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]" import1 = "from sklearn.metrics.pairwise import cosine_similarity" stmt1 = "[float(cosine

Pandas groupby in combination with sklean preprocessing continued

阅读更多关于 Pandas groupby in combination with sklean preprocessing continued

问题 Continue from this post: Pandas groupby in combination with sklearn preprocessing I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method import pandas as pd import numpy as np from sklearn.preprocessing import robust_scale,minmax_scale df = pd.DataFrame( dict( id=list('AAAAABBBBB'), loc = (10,20,10,20,10,20,10,20,10,20), value=(0,10,10,20,100,100,200,30,40,100))) df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale

How to create custom eval metric for catboost?

阅读更多关于 How to create custom eval metric for catboost?

问题 Similar SO questions: Python Catboost: Multiclass F1 score custom metric Catboost tutorials https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function Question In this question, I have a binary classification problem. After modelling we get the test model predictions y_pred and we already have true test labels y_true . I would like to get the custom evaluation metric defined by following equation: profit = 400 * truePositive - 200*fasleNegative - 100