scikit-learn | 易学教程

how to assess the confidence score of a prediction with scikit-learn

阅读更多关于 how to assess the confidence score of a prediction with scikit-learn

问题 I have write down a simple code that takes One arguments "query_seq", further methods calculates descriptor and in the end predictions can be made using "LogisticRegression" (or any other algorithm provided with the function) algorithms as "0 (negative for given case)" or "1 (positive for given case)" def main_process(query_Seq): LR = LogisticRegression() GNB = GaussianNB() KNB = KNeighborsClassifier() DT = DecisionTreeClassifier() SV = SVC(probability=True) train_x, train_y,train_l = data

Specific number of test/train size for each class in sklearn

阅读更多关于 Specific number of test/train size for each class in sklearn

问题 Data: import pandas as pd data = pd.DataFrame({'classes':[1,1,1,2,2,2,2],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]}) My code: import numpy as np from sklearn.cross_validation import train_test_split X = np.array(data[['b','c']]) y = np.array(data['classes']) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=4) Question: train_test_split will randomly choose test set from all the classes. Is there any way to have the same number of test set for each class ? (For example

Computing scikit-learn multiclass ROC Curve with cross validation (CV)

阅读更多关于 Computing scikit-learn multiclass ROC Curve with cross validation (CV)

问题 I want to evaluate my classification models with a ROC curve. I'm struggling with computing a multiclass ROC Curve for a cross-validated data set. There is no division in train and test set, because of the cross-validation. Underneath, you can see the code I already tried. scaler = StandardScaler(with_mean=False) enc = LabelEncoder() y = enc.fit_transform(labels) vec = DictVectorizer() feat_sel = SelectKBest(mutual_info_classif, k=200) n_classes = 3 # Pipeline for computing of ROC curves clf

Computing scikit-learn multiclass ROC Curve with cross validation (CV)

阅读更多关于 Computing scikit-learn multiclass ROC Curve with cross validation (CV)

How to normalize a confusion matrix?

阅读更多关于 How to normalize a confusion matrix?

问题 I calculated a confusion matrix for my classifier using the method confusion_matrix() from the sklearn package. The diagonal elements of the confusion matrix represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. I would like to normalize my confusion matrix so that it contains only numbers between 0 and 1. I would like to read the percentage of correctly classified samples from the

Fitting data vs. transforming data in scikit-learn

阅读更多关于 Fitting data vs. transforming data in scikit-learn

问题 In scikit-learn, all estimators have a fit() method, and depending on whether they are supervised or unsupervised, they also have a predict() or transform() method. I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard: fit_transform(X, y=None, **fit_params) Fit to data, then transform it. In this context, what is meant

Fitting data vs. transforming data in scikit-learn

阅读更多关于 Fitting data vs. transforming data in scikit-learn

机器学习：评价分类结果（F1 Score）

阅读更多关于机器学习：评价分类结果（F1 Score）

一、基础疑问1 ：具体使用算法时，怎么通过精准率和召回率判断算法优劣？根据具体使用场景而定：例1 ：股票预测，未来该股票是升还是降？业务要求更精准的找到能够上升的股票；此情况下，模型精准率越高越优。例2 ：病人诊断，就诊人员是否患病？业务要求更全面的找出所有患病的病人，而且尽量不漏掉一个患者；甚至说即使将正常人员判断为病人也没关系，只要不将病人判断成健康人员就好。此情况，模型召回率越高越优。疑问2 ：:有些情况下，即需要考虑精准率又需要考虑召回率，二者所占权重一样，怎么中欧那个判断？方法：采用新的评价标准，F1 Score；二、F1 Score F1 Score ：兼顾降准了和召回率，当急需要考虑精准率又需要考虑召回率，可查看模型的 F1 Score，根据 F1 Score 的大小判断模型的优劣； F1 = 2 * Precision * recall / (precision + recall) ，是二者的调和平均值； F1 是 precision 和 recall 的调和平均值；调和平均值：如果 1/a = (1/b + 1/c) / 2，则称 a 是 b 和 c 的调和平均值；调和平均值特点：|b - c| 越大，a 越小；当 b - c = 0 时，a = b = c，a 达到最大值；具体到精准率和召回率，只有当二者大小均衡时，F1

少有人知的python数据科学库

阅读更多关于少有人知的python数据科学库

Python是门很神奇的语言，历经时间和实践检验，受到开发者和数据科学家一致好评，目前已经是全世界发展最好的编程语言之一。简单易用，完整而庞大的第三方库生态圈，使得Python成为编程小白和高级工程师的首选。在本文中，我们会分享不同于市面上的python数据科学库（如numpy、padnas、scikit-learn、matplotlib等），尽管这些库很棒，但是其他还有一些不为人知，但同样优秀的库需要我们去探索去学习。 1. Wget 从网络上获取数据被认为是数据科学家的必备基本技能，而Wget是一套非交互的基于命令行的文件下载库。ta支持HTTP、HTTPS和FTP协议，也支持使用IP代理。因为ta是非交互的，即使用户未登录，ta也可以在后台运行。所以下次如果你想从网络上下载一个页面，Wget可以帮到你哦。安装 pip isntall wget 用例 import wget url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3' filename = wget.download(url) Run and output 100% [................................................] 3841532 / 3841532 filename

04_data特征预处理 of 特征工程【day1】

阅读更多关于 04_data特征预处理 of 特征工程【day1】

0、Xmind 1、data的特征预处理　　　　 1、what is 特征处理？　　统计方法，要求的data 　　 2、特征预处理的方式　　 3、sklearn.preprocessing 　　there are all 预处理 method 2、归一化 1. what is 归一化？　　原始data -----变换、映射----> [0,1] 2. 公式　　　　　　　计算过程　　　　　　　　 3. sklearn.preprocessing.MinMaxScalar 　　sklearn.preprocessing.MinMaxScalar 　　　　　　　　　　　　　　　scalar缩放　　语法　　　　步骤　　　　 input：二维array 　代码 from sklearn.preprocessing import MinMaxScaler # 归一化 def minmaxSclar(): """ 归一化处理 :return: None """ # mm = MinMaxScaler() mm = MinMaxScaler(feature_range=(2,3 )) data = mm.fit_transform([[90,2,10,40],[60,4,15,45],[75,3,13,46 ]]) print (data) if __name__

订阅 scikit-learn