scikit-learn

Choosing an sklearn pipeline for classifying user text data

戏子无情 提交于 2021-02-19 08:15:52
问题 I'm working on a machine learning application in Python (using the sklearn module), and am currently trying to decide on a model for performing inference. A brief description of the problem: Given many instances of user data, I'm trying to classify them into various categories based on relative keyword containment. It is supervised, so I have many, many instances of pre-classified data that are already categorized. (Each piece of data is between 2 and 12 or so words.) I am currently trying to

Scikit-learn machine learning models training using multiple CPUs

允我心安 提交于 2021-02-19 06:50:13
问题 I want to decrease training time of my models by using a high end EC2 instance. So I tried c5.18xlarge instance with 2 CPUs and run a few models with parameter n_jobs=-1 but I noticed that only one CPU was utilized: Can I somehow make Scikit-learn to use all CPUs? 回答1: Try adding: import multiprocessing multiprocessing.set_start_method('forkserver') at the top of your code, before running or importing anything. That's a well-known issue with multiprocessing in python. 来源: https:/

Scikit-learn's Kernel PCA: How to implement an anisotropic Gaussian kernel or any other custom kernels in KPCA?

点点圈 提交于 2021-02-19 05:54:08
问题 I'm currently using the Scikit-learn's KPCA to perform dimensionality reduction on my dataset. They have the isotropic Gaussian kernel (RBF kernel) which only has one value gamma. But now, I want to implement an anisotropic Gaussian kernel that has many values of gamma that depend on the number of dimensions. I'm aware that Kernel PCA has an option for precomputed kernel but I couldn't find any code example of it being used for dimensionality reduction. Does anyone know how to implement a

Scikit-learn RandomizedLasso and RandomizedLogisticRegression Deprecated

倖福魔咒の 提交于 2021-02-19 04:27:10
问题 I noticed that linear_model.RandomizedLasso and linear_model.RandomizedLogisticRegression which implement stability selection for lasso regression have been deprecated. Does anyone know why? Is stability selection not a sound method? 回答1: Scikit-learn is developed as open-source and with high standards. This means, that most decisions are transparent. So you can check out their repo @ github and with some search you will find: the discussion which lead to deprecation a newer discussion

CountVectorizer with Pandas dataframe

£可爱£侵袭症+ 提交于 2021-02-19 01:06:38
问题 I am using scikit-learn for text processing, but my CountVectorizer isn't giving the output I expect. My CSV file looks like: "Text";"label" "Here is sentence 1";"label1" "I am sentence two";"label2" ... and so on. I want to use Bag-of-Words first in order to understand how SVM in python works: import pandas as pd from sklearn import svm from sklearn.feature_extraction.text import CountVectorizer data = pd.read_csv(open('myfile.csv'),sep=';') target = data["label"] del data["label"] #

CountVectorizer with Pandas dataframe

半世苍凉 提交于 2021-02-19 01:05:07
问题 I am using scikit-learn for text processing, but my CountVectorizer isn't giving the output I expect. My CSV file looks like: "Text";"label" "Here is sentence 1";"label1" "I am sentence two";"label2" ... and so on. I want to use Bag-of-Words first in order to understand how SVM in python works: import pandas as pd from sklearn import svm from sklearn.feature_extraction.text import CountVectorizer data = pd.read_csv(open('myfile.csv'),sep=';') target = data["label"] del data["label"] #

How to implement a meta-estimator with the scikit-learn API?

送分小仙女□ 提交于 2021-02-19 00:24:18
问题 I would like to implement a simple wrapper / meta-estimator which is compatible with all of scikit-learn. It is hard to find a full description of what exactly I need. The goal is to have a regressor which also learns a threshold to become a classifier. So I came up with: from sklearn.base import BaseEstimator, ClassifierMixin, clone class Thresholder(BaseEstimator, ClassifierMixin): def __init__(self, regressor): self.regressor = regressor # threshold_ does not get initialized in __init__ ??

How to implement a meta-estimator with the scikit-learn API?

Deadly 提交于 2021-02-19 00:23:15
问题 I would like to implement a simple wrapper / meta-estimator which is compatible with all of scikit-learn. It is hard to find a full description of what exactly I need. The goal is to have a regressor which also learns a threshold to become a classifier. So I came up with: from sklearn.base import BaseEstimator, ClassifierMixin, clone class Thresholder(BaseEstimator, ClassifierMixin): def __init__(self, regressor): self.regressor = regressor # threshold_ does not get initialized in __init__ ??

Spark.ml regressions do not calculate same models as scikit-learn

允我心安 提交于 2021-02-18 22:09:54
问题 I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart? I give the sklearn code and spark.ml code below. Both should be ready to cut

Make better machine learning prediction thanks to negative feedback

冷暖自知 提交于 2021-02-18 18:19:07
问题 I'm currently using sklearn library in python to use supervised machine learning. I have a list of records like this: [x1, x2, x3] -> [y1] And i'm using the Bag Of Words technique. It all works. Sometimes it could happen that the user says the prediction is not right. Something like a negative record: [x1, x2, x3] != [y1] I would like that if this happens the next time (or after many negative feedbacks) the same prediction won't appear. 来源: https://stackoverflow.com/questions/45545178/make