knn

Using cosine distance with scikit learn KNeighborsClassifier

て烟熏妆下的殇ゞ 提交于 2019-11-30 19:14:28
Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier? This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this? The cosine

kNN: training, testing, and validation

泪湿孤枕 提交于 2019-11-30 14:43:10
I am extracting image features from 10 classes with 1000 images each. Since there are 50 features that I can extract, I am thinking of finding the best feature combination to use here. Training, validation and test sets are divided as follows: Training set = 70% Validation set = 15% Test set = 15% I use forward feature selection on the validation set to find the best feature combination and finally use the test set to check the overall accuracy. Could someone please tell me whether I am doing it right? So kNN is an exception to general workflow for building/testing supervised machine learning

K-NN算法概述

梦想与她 提交于 2019-11-30 10:28:59
一、KNN算法(k-NearestNeighbor),k临近值算法:在给出一个数据点以后,判断它和已有数据点之间的距离,取k个距离最近的点,这些点中存在的那一类点最多就讲这个新的数据点归位那一类。 • 容易存在的问题:   1.、k 值过小,容易出现过 拟合问题 ,结果就是在训练集上准确度很高,但是在测试集上就很低。   2、特征的比重失衡。在计算样本点之间的距离时,如果不同的维度存在数量级差异,就会导致某些特征所起到的作用(对距离的影响)过大或过小。所以要进行归一化处理来避免这种问题的出现。 • 距离的度量:欧式距离、曼哈顿距离、取最大值等等 二、k-d(K-demension tree)树:将空间划分为特定的几个部分,在特定的部分内进行相关搜索。 来源: https://www.cnblogs.com/yyf2019/p/11578878.html

KNN算法

橙三吉。 提交于 2019-11-30 10:19:25
可以看到k-近邻算法就是通过距离来解决分类问题。这里我们解决的二分类问题,整个算法结构如下: l 算距离 给定测试对象𝐼𝑡𝑒𝑚,计算它与训练集中每个对象的距离。 依据公式计算𝐼𝑡𝑒𝑚 与𝐷&,𝐷(,……𝐷*之间的相似度,得到𝑆𝑖𝑚 𝐼𝑡𝑒𝑚,𝐷& , 𝑆𝑖𝑚 𝐼𝑡𝑒𝑚,𝐷( , 𝑆𝑖𝑚 𝐼𝑡𝑒𝑚,𝐷* . l 找邻居 圈定距离最近的k个训练对象,作为测试对象的近邻。 将𝑆𝑖𝑚 𝐼𝑡𝑒𝑚,𝐷& , 𝑆𝑖𝑚 𝐼𝑡𝑒𝑚,𝐷( , 𝑆𝑖𝑚 𝐼𝑡𝑒𝑚,𝐷* 排序,若是超过相似度阈值𝑡,则放入邻居集合𝑁𝑁. l 做分类 根据这k个近邻归属的主要类别,来对测试对象进行分类。 自邻居集合𝑁𝑁中取出前k名,查看它们的标签,对这k个点的标签求和,以多数决,得到𝐼𝑡𝑒𝑚可能类别。 来源: https://www.cnblogs.com/wangwendi----/p/11578365.html

KNN

雨燕双飞 提交于 2019-11-30 10:08:29
k-近邻(kNN, k-NearestNeighbor)算法是一种基本分类与回归方法 优缺点: 优点:精度高、对异常值不敏感、无数据输入假定 缺点:计算复杂度高、空间复杂度高 适用数据范围:数值型和标称型 流程伪代码: 对于每一个在数据集中的数据点: 计算目标的数据点(需要分类的数据点)与该数据点的距离 将距离排序:从小到大 选取前K个最短距离 选取这K个中最多的分类类别 返回该类别来作为目标数据点的预测值 核心代码: def classify0(inX, dataSet, labels, k): # 1. 距离计算 dataSetSize = dataSet.shape[0] # tile生成和训练样本对应的矩阵,并与训练样本求差 diffMat = tile(inX, (dataSetSize, 1)) - dataSet # 取平方 sqDiffMat = diffMat ** 2 # 将矩阵的每一行相加 sqDistances = sqDiffMat.sum(axis=1) # 开方 distances = sqDistances ** 0.5 # 根据距离排序从小到大的排序,返回对应的索引位置 # argsort() 是将x中的元素从小到大排列,提取其对应的index(索引),然后输出到y。 # 例如:y=array([3,0,2,1,4,5]) 则,x[3]=1最小

KNN algo in matlab

不问归期 提交于 2019-11-30 07:46:06
I am working on thumb recognition system. I need to implement KNN algorithm to classify my images. according to this , it has only 2 measurements, through which it is calculating the distance to find the nearest neighbour but in my case I have 400 images of 25 X 42, in which 200 are for training and 200 for testing. I am searching for few hours but I am not finding the way to find the distance between the points. EDIT: I have reshaped 1st 200 images in to 1 X 1050 and stored them in a matrix trainingData of 200 X 1050. similarly I made testingData . Here is an illustration code for k-nearest

kNN(从文本文件中解析数据)

青春壹個敷衍的年華 提交于 2019-11-30 05:53:00
# 准备数据:从文本文件中解析数据# 在kNN.py中创建名为file2matrix的函数,处理输入格式问题# 该函数的输入为文件名字符串,输出为训练样本矩阵和类标签向量# 将文本记录到转换Numpy的解析程序 1 def file2matrix(filename): 2 fr = open(filename) 3 arrayOLines = fr.readlines() 4 numberOfLines = len(arrayOLines) #得到文件行数 5 returnMat = zeros((numberOfLines,3)) #创建返回的Numpy矩阵 6 classLabelVector = [] 7 index = 0 8 for line in arrayOLines: #解析文件数据列表 9 line = line.strip() #使用line.strip()截取掉所有的回车字符 10 listFromLine = line.split('\t') #使用tab字符\t将上一步得到的整行数据分割成一个元素列表 11 returnMat[index,:] = listFromLine[0:3] #选取前三个元素,存储到特征矩阵中 12 classLabelVector.append(int(listFromLine[-1])) #-1表示列表中的最后一列元素

Increasing n_jobs has no effect on GridSearchCV

落爺英雄遲暮 提交于 2019-11-30 05:27:09
问题 I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier . The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right. There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for

k-近邻算法(kNN)

独自空忆成欢 提交于 2019-11-30 04:25:51
一、k-近邻算法(kNN) 工作原理:存在一个样本数据集合(训练样本集),并且样本集中每个数据都存在标签,即我们知道样本集中每一数据与所属分类的对应关系。   输入没有标签的新数据后,将新数据的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征醉相思数据(最近邻)的分类标签。   一般来说,我们只选择样本数据集中前k个最相似的数据,(k的来源),通常k<=20的整数,选择k个最相似数据中出现次数最多的分类,作为新数据的分类。   一般流程:收集-准备-分析数据-训练-测试-使用算法。 1.使用Python导入数据 1 from numpy import *#科学计算包 2 import operator #运算符模块 3 4 def createDataSet(): 5 group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) 6 labels = ['A','A','B','B'] 7 return group,labels 8 9 def classify0(inX, dataSet, labels, k):#用于分类的输入向量inX,输入的训练样本集dataSet,标签向量labels,参数k用于选择最近邻居的数目 10 dataSetSize = dataSet.shape[0] 11 diffMat = tile

Computing sparse pairwise distance matrix in R

人走茶凉 提交于 2019-11-30 03:08:46
I have a NxM matrix and I want to compute the NxN matrix of Euclidean distances between the M points. In my problem, N is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k smallest distances, so the resulting NxN matrix is very sparse. This is in contrast to what comes out of dist() , for example, which would result in a dense matrix (and probably storage problems for my size N ). The packages for kNN that I've found so far ( knnflex , kknn , etc) all appear to use dense matrices. Also, the Matrix package does not offer a pairwise