scikit-learn | 易学教程

sklearn increasing number of jobs leads to slow training

阅读更多关于 sklearn increasing number of jobs leads to slow training

问题 I've been trying to get sklearn to use more cpu cores during gridsearch (doing this on a Windows machine). Code is this: parameters = {'n_estimators': numpy.arange(1,10), 'max_depth':numpy.arange(1,10)} estimator = RandomForestClassifier(verbose=1) clf = grid_search.GridSearchCV(estimator, parameters, n_jobs=-1) clf.fit(features_train, labels_train) I'm testing this on a small dataset of only 100 samples. When n_jobs is set to 1 (default), everything proceeds as normal and finishes quickly.

sklearn increasing number of jobs leads to slow training

阅读更多关于 sklearn increasing number of jobs leads to slow training

ValueError: Axes instance argument was not found in a figure

阅读更多关于 ValueError: Axes instance argument was not found in a figure

问题 I am studying scikit-learn with 'Learning scikit-learn: Machine Learning in Python by Raúl Garreta'. In jupyter Notebook, from code In[1] to In[7] it works. But In[8] code does not work. Which is wrong? # In[1]: from sklearn import datasets iris = datasets.load_iris() X_iris, y_iris = iris.data, iris.target print X_iris.shape, y_iris.shape # In[2]: from sklearn.cross_validation import train_test_split from sklearn import preprocessing X, y = X_iris[:, :2], y_iris X_train, X_test, y_train, y

Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)

阅读更多关于 Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)

问题 I'm using decision tree classifier from the scikit-learn package in python 3.4, and I want to get the corresponding leaf node id for each of my input data point. For example, my input might look like this: array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2]]) and let's suppose the corresponding leaf nodes are 16, 5 and 45 respectively. I want my output to be: leaf_node_id = array([16, 5, 45]) I have read through the scikit-learn mailing list and related questions on SF

ValueError: Axes instance argument was not found in a figure

阅读更多关于 ValueError: Axes instance argument was not found in a figure

How to compare if two sklearn estimators are equals?

阅读更多关于 How to compare if two sklearn estimators are equals?

问题 I have two sklearn estimators and want to compare them: import numpy as np from sklearn.tree import DecisionTreeClassifier X, y = np.random.random((100,2)), np.random.choice(2,100) dt1 = DecisionTreeClassifier() dt1.fit(X, y) dt2 = DecisionTreeClassifier() dt3 = sklearn.base.copy.deepcopy(dt1) How can I compare classifiers so that dt1 != dt2, dt1 == dt3? 回答1: You will want to compare the params assigned to the classifier instance and the .tree_.value of the trained classifiers: # the trees

Scikit Learn HMM training with set of observation sequences

阅读更多关于 Scikit Learn HMM training with set of observation sequences

问题 I had a question about how I can use gaussianHMM in the scikit-learn package to train on several different observation sequences all at once. The example is here: visualizing the stock market structure shows EM converging on 1 long observation sequence. But in many scenarios, we want to break up the observations (like training on set of sentences) with each observation sequence having a START and END state. That is, I would like to globally train on multiple observation sequences. How can one

Scikit Learn HMM training with set of observation sequences

阅读更多关于 Scikit Learn HMM training with set of observation sequences

Pandas 'Passing list-likes to .loc or [] with any missing labels is no longer supported' on train_test_split returned data

阅读更多关于 Pandas 'Passing list-likes to .loc or [] with any missing labels is no longer supported' on train_test_split returned data

问题 For some reason train_test_split, despite lengths being identical and indexes look the same, triggers this error. from sklearn.model_selection import KFold data = {'col1':[30.5,45,1,99,6,5,4,2,5,7,7,3], 'col2':[99.5, 98, 95, 90,1,5,6,7,4,4,3,3],'col3':[23, 23.6, 3, 90,1,9,60,9,7,2,2,1]} df = pd.DataFrame(data) train, test = train_test_split(df, test_size=0.10) X = train[['col1', 'col2']] y2 = train['col3'] X = np.array(X) kf = KFold(n_splits=3, shuffle=True) for train_index, test_index in kf

Consistent ColumnTransformer for intersecting lists of columns

阅读更多关于 Consistent ColumnTransformer for intersecting lists of columns

问题 I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way: log_transformer = p.FunctionTransformer(lambda x: np.log(x)) df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]}) compose.ColumnTransformer(n_jobs=1, transformers=[ ('num', impute.SimpleImputer() , ['a', 'b']), ('log', log_transformer, ['b', 'c']), ('scale', p