auc | 易学教程

python实现―逻辑回归―实战案例（包含全部代码和样本数据，见文章底部百度网盘链接）

阅读更多关于 python实现―逻辑回归―实战案例（包含全部代码和样本数据，见文章底部百度网盘链接）

#-*- coding:utf-8 -*- import pandas as pd import numpy as np from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression data_lr = pd.read_excel( 'D:\python原始数据\logist_model.xlsx' , 'logist_model' ) print (data_lr.shape) print (data_lr.head( 10 )) array = data_lr.values X_train = array [ 0 : 200 , 2 : 5 ] Y_train = array [ 0 : 200 , 5 ] X_test = array [ 200 : 291 , 2 : 5 ] Y_test = array [ 200 : 291 , 5 ] model = LogisticRegression() model.fit(X_train, Y_train) print ( "截距项" ,model.intercept_) print ( "系数"

python作业之sklearn

阅读更多关于 python作业之sklearn

题目： 1 Create a classification dataset (n samples 1000, n features 10) 2 Split the dataset using 10-fold cross validation 3 Train the algorithms GaussianNB SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel) RandomForestClassifier (possible n estimators values [10, 100, 1000]) 4 Evaluate the cross-validated performance Accuracy F1-score AUC ROC 5 Write a short report summarizing the methodology and the results from sklearn import datasets,cross_validation from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier

python爬虫表单总结

阅读更多关于 python爬虫表单总结

动态内容，ajax的数据在XHR里面，刷新可以查看新抓的包里面有没有自己想要的数据。 IF-TargetVerb: POST IF-TargetContent: [{"Lbl":"attachmentWrapper","Src":"div.InFlightAttachment:first","Data":"null","HWA":".","Children":[{"Lbl":"attachmentLink","Src":".","Data":"text:href","Children":[]}]},{"Lbl":"popupMessageContent","Src":"span.InFlightPopup","Data":"html","Children":[]},{"Lbl":"item2","Src":"[id=ZZ_VNDR_AD_WRK_DESCR2000]","Data":"value","Children":[]},{"Lbl":"acceptInvite","Src":"#RESP_INQ_DL0_WK_BID_INV_ACCPT_BTN","Data":"id name","HWA":".","Children":[]},{"Lbl":"if_error_items","Src":"div[id*=RESP_ERR_HTMLAREA]:eq(0)","Data":

AUC-ROC for a none ranking Classifier such as OSVM

阅读更多关于 AUC-ROC for a none ranking Classifier such as OSVM

问题 Im currently working with auc-roc curves , and lets say that I have a none ranking Classifier such as a one class SVM where the predictions are either 0 and 1 and the predictions are not converted to probabilities or scores easily , if I do not want to plot the AUC-ROC instead I would only like to calculate the AUC to use it to see how well my model is doing , can I still do that ? would it still be called or as an AUC especially that there are two thresholds that can be used (0 , 1 ) ? if it

What is a threshold in a Precision-Recall curve?

阅读更多关于 What is a threshold in a Precision-Recall curve?

I am aware of the concept of Precision as well as the concept of Recall. But I am finding it very hard to understand the idea of a 'threshold' which makes any P-R curve possible. Imagine I have a model to build that predicts the re-occurrence (yes or no) of cancer in patients using some decent classification algorithm on relevant features. I split my data for training and testing. Lets say I trained the model using the train data and got my Precision and Recall metrics using the test data. But HOW can I draw a P-R curve now? On what basis? I just have two values, one precision and one recall.

sklearn.metrics中的评估方法介绍（accuracy_score, recall_score, roc_curve, roc_auc_score, confusion_matrix）

阅读更多关于 sklearn.metrics中的评估方法介绍（accuracy_score, recall_score, roc_curve, roc_auc_score, confusion_matrix）

1 accuracy_score：分类准确率分数是指所有分类正确的百分比。分类准确率这一衡量分类器的标准比较容易理解，但是它不能告诉你响应值的潜在分布，并且它也不能告诉你分类器犯错的类型。常常误导初学者：呵呵。 sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None) normalize ：默认值为 T rue ，返回正确分类的比例；如果为 False ，返回正确分类的样本数 import numpy as np from sklearn.metrics import accuracy_score y_pred = [0, 2, 1, 3] y_true = [0, 1, 2, 3] accuracy_score(y_true, y_pred) #0.5 输出结果 accuracy_score(y_true, y_pred, normalize=False) #2 输出结果 2 recall_score ：召回率 = 提取出的正确信息条数 / 样本中的信息条数。通俗地说，就是所有准确的条目有多少被检索出来了。 klearn.metrics.recall_score(y_true, y_pred, labels=None, pos_label=1,average=

Python计算AUC

阅读更多关于 Python计算AUC

AUC（Area under curve）是机器学习常用的二分类评测手段，直接含义是ROC曲线下的面积。另一种解释是：随机抽出一对样本（一个正样本，一个负样本），然后用训练得到的分类器来对这两个样本进行预测，预测得到正样本的概率大于负样本概率的概率。在有M个正样本,N个负样本的数据集里，利用公式求解： \[ AUC=\frac{\sum_{i \in positiveClass} rank_i-\frac{M(1+M)}{2}}{M*N} \] 在python实现中，相当于使用了计数排序，因为概率是一个小数，我们同时乘以100取整数进行排序（也可以根据精度调整）。在排完序后，我们就可以得到正样本概率大于负样本概率的个数，再加上正样本概率等于负样本概率的个数的一半，除以总共的样本数（M*N），即可得到最终的AUC值。 def AUC(labels,preds,n_bins=100): m = sum(labels) n = len(labels) - m total_case = m * n pos = [0 for _ in range(n_bins)] neg = [0 for _ in range(n_bins)] bin_width = 1.0 / n_bins for i in range(len(labels)): nth_bin = int(preds[i]/bin

泰坦尼克号生存预测分析

阅读更多关于泰坦尼克号生存预测分析

此文发表在简书，复制过来，在下方放上链接。 https://www.jianshu.com/p/a09b4dc904c9 泰坦尼克号生存预测 1. 背景与挖掘目标 “泰坦尼克号”的沉没是历史上最臭名昭著的海难之一。1912年4月15日，泰坦尼克号在处女航中与冰山相撞后沉没，2224名乘客和机组人员中有1502人死亡。这场耸人听闻的悲剧震惊了国际社会，并导致了更好的船舶安全条例。造成沉船事故的原因之一是没有足够的救生艇供乘客和机组人员使用。虽然在沉没中幸存了一些运气，但一些人比其他人更容易生存，如妇女、儿童和上层阶级。请根据这些数据（见数据来源）实现以下目标。在这个挑战中，我们要求你们完成对什么样的人可能生存的分析。特别是，我们要求你运用机器学习的工具来预测哪些乘客在悲剧中幸存下来。 2. 分析方法与过程泰坦尼克号生存预测主要包括以下步骤。 1）数据描述性统计 2）对步骤1）进行数据探索分析（寻找特征值）与预处理，包括数据缺失值的探索分析，数据的属性规约，清洗和变换 3）利用2）中形成的已完成数据预处理的建模数据训练模型 4）针对模型结果预测测试集人的生存情况 2.1数据来源和含义数据来自于知名机器学习竞赛网站kaggle： https://www.kaggle.com/c/titanic/data 说明： PassengerId => 乘客ID Pclass =>

roc_auc_score - Only one class present in y_true

阅读更多关于 roc_auc_score - Only one class present in y_true

I am doing a k-fold XV on an existing dataframe, and I need to get the AUC score. The problem is - sometimes the test data only contains 0s, and not 1s! I tried using this example, but with different numbers: import numpy as np from sklearn.metrics import roc_auc_score y_true = np.array([0, 0, 0, 0]) y_scores = np.array([1, 0, 0, 0]) roc_auc_score(y_true, y_scores) And I get this exception: ValueError: Only one class present in y_true. ROC AUC score is not defined in that case. Is there any workaround that can make it work in such cases? You could use try-except to prevent the error: import

超参数的调优（lightgbm)

阅读更多关于超参数的调优（lightgbm)

参考原文 Automated Hyperparameter Optimization 超参数的优化过程：通过自动化目的：使用带有策略的启发式搜索（informed search）在更短的时间内找到最优超参数，除了初始设置之外，并不需要额外的手动操作。实践部分贝叶斯优化问题有四个组成部分：目标函数：我们想要最小化的对象，这里指带超参数的机器学习模型的验证误差域空间：待搜索的超参数值优化算法：构造代理模型和选择接下来要评估的超参数值的方法结果的历史数据：存储下来的目标函数评估结果，包含超参数和验证损失通过以上四个步骤，我们可以对任意实值函数进行优化（找到最小值）。这是一个强大的抽象过程，除了机器学习超参数的调优，它还能帮我们解决其他许多问题。代码示例数据集：https://www.jiqizhixin.com/articles/2018-08-08-2 目标：预测客户是否会购买一份保险产品监督分类问题观测值：5800 测试点：4000 不平衡的分类问题，本文使用的评价性能的指标是受试者工作特征曲线下的面积（ROC AUC），ROC AUC 的值越高越好，其值为 1 代表模型是完美的。什么是不平衡的分类问题？如何处理数据中的「类别不平衡」？极端类别不平衡数据下的分类问题S01：困难与挑战 hyperropt1125.py - 导入库 import

订阅 auc