scikit-learn | 易学教程

捋一捋少有人知的 Python "重试机制"

阅读更多关于捋一捋少有人知的 Python "重试机制"

点击上方“ Python爬虫与数据挖掘 ”，进行关注回复“ 书籍 ”即可获赠Python从入门到进阶共10本电子书今日鸡汤弃我去者，昨日之日不可留。周末愉快，欢迎小伙伴积极学习，文末有《 Python数据分析》5本书籍的送书活动，记得参与噢~ 为了避免由于一些网络或等其他不可控因素，而引起的功能性问题。比如在发送请求时，会因为网络不稳定，往往会有请求超时的问题。这种情况下，我们通常会在代码中加入重试的代码。重试的代码本身不难实现，但如何写得优雅、易用，是我们要考虑的问题。这里要给大家介绍的是一个第三方库 - Tenacity （标题中的重试机制并并不准确，它不是 Python 的内置模块，因此并不能称之为机制），它实现了几乎我们可以使用到的所有重试场景，比如：在什么情况下才进行重试？重试几次呢? 重试多久后结束？每次重试的间隔多长呢？重试失败后的回调？在使用它之前，先要安装它 $ pip install tenacity 1. 最基本的重试无条件重试，重试之间无间隔 from tenacity import retry @retry def test_retry () : print( "等待重试，重试无间隔执行..." ) raise Exception test_retry() 无条件重试，但是在重试之前要等待 2 秒 from

understanding max_feature in random forest

阅读更多关于 understanding max_feature in random forest

问题 I got a question when training the forest. I used a 5-fold cross validation and rmse as guideline to figure out the best parameter for the model. I eventually find that when the max_feature=1, I got the smallest rmse. That's strange to me, since max_feature is the feature considered at each split. Generally, if I want to find the "best" parameter to lowest the impurity in splitting, the tree should, at best, consider all the features and find the one result in lowest impurity after splitting.

Comparing the GLMNET output of R with Python using LogisticRegression()

阅读更多关于 Comparing the GLMNET output of R with Python using LogisticRegression()

问题 I am using Logistic Regression with the L1 norm (LASSO). I have opted to used the glmnet package in R and the LogisticRegression() from the sklearn.linear_model in python . From my understanding this should give the same results however they are not. Note that I did not scale my data. For python I have used the below link as a reference: https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/ and for R I have used the below link: http://www

sklearn precision_recall_curve and threshold

阅读更多关于 sklearn precision_recall_curve and threshold

问题 I was wondering how sklearn decides how many thresholds to use in precision_recall_curve. There is another post on this here: How does sklearn select threshold steps in precision recall curve?. It mentions the source code where I found this example import numpy as np from sklearn.metrics import precision_recall_curve y_true = np.array([0, 0, 1, 1]) y_scores = np.array([0.1, 0.4, 0.35, 0.8]) precision, recall, thresholds = precision_recall_curve(y_true, y_scores) which then gives >>>precision

ERROR: Create Version failed. Bad model detected with error: "Failed to load model: Could not load the model

阅读更多关于 ERROR: Create Version failed. Bad model detected with error: "Failed to load model: Could not load the model

问题 clf = svm.SVC() # Giving test data as input clf.fit(X_train, y_train) joblib.dump(clf, 'model.joblib') GCP_PROJECT = 'career-banao-project' BUCKET_NAME="career_banao_bucket" MODEL_BUCKET = 'gs://career_banao_bucket' VERSION_NAME = 'v1' MODEL_NAME = 'career_banao_model' !gsutil mb $MODEL_BUCKET !gsutil cp ./model.joblib $MODEL_BUCKET !gcloud ai-platform models create $MODEL_NAME !gcloud ai-platform versions create $VERSION_NAME \ --model=$MODEL_NAME \ --framework='scikit-learn' \ --runtime

Problem with “ValueError: Expected 2D array, got 1D array instead”

阅读更多关于 Problem with “ValueError: Expected 2D array, got 1D array instead”

问题 I need to run a SVR (supported vector regression). I have a CSV data frame.I had no problems to run the OLS regression, with one target variable and multiple regressors. But I have a problem with this part of the code. So, here is my code: import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.svm import SVR sc_X = StandardScaler() sc_y = StandardScaler() X = sc_X.fit_transform(X) y = sc_y.fit_transform(y) y_pred = sc_y.inverse

show overfitting with sklearn & random forest

阅读更多关于 show overfitting with sklearn & random forest

问题 I followed this tutorial to create a simple image classification script: https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/ train_data = scipy.io.loadmat('extra_32x32.mat') # extract the images and labels from the dictionary object X = train_data['X'] y = train_data['y'] X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T y = y.reshape(y.shape[0],) X, y = shuffle(X, y, random_state=42) .... clf = RandomForestClassifier() print(clf) start_time = time.time()

TypeError: only integer scalar arrays can be converted to a scalar index , while trying kfold cv

阅读更多关于 TypeError: only integer scalar arrays can be converted to a scalar index , while trying kfold cv

问题 Trying to perform Kfold cv on a dataset containing 279 files , the files are of shape ( 279 , 5 , 90) after performing a k-means. I reshaped it in order to fit it on a svm. Now the shape is ( 279, 5*90 ) . Trying the Kfold cv approach gives me the error "TypeError: only integer scalar arrays can be converted to a scalar index " #input with open("dataset.pkl", "rb") as file: dataset = pkl.load(file) print(len(dataset)) x = [i[0] for i in dataset] #k-means cc y = [i[1] for i in dataset] #label

show overfitting with sklearn & random forest

阅读更多关于 show overfitting with sklearn & random forest

Performing one hot encoding on two columns of string data

阅读更多关于 Performing one hot encoding on two columns of string data

问题 I am trying to predict 'Full_Time_Home_Goals' My code is: import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_absolute_error from sklearn.ensemble import RandomForestRegressor import os import xlrd import datetime import numpy as np # Set option to display all the rows and columns in the dataset. If there are more rows, adjust number accordingly. pd.set_option('display.max_rows', 5000) pd.set