knn

What's the difference between ANN, SVM and KNN classifiers?

牧云@^-^@ 提交于 2019-12-06 08:59:47
I know this is a very general question without specifics about my actual project, but my question is: I am doing remote sensing image classification. I am using the object-oriented method: first I segmented the image to different regions, then I extract the features from regions such as color, shape and texture. The number of all features in a region may be 30 and commonly there are 2000 regions in all, and I will choose 5 classes with 15 samples for every class. In summary: Sample data 1530 Test data 197530 How do I choose the proper classifier? If there are 3 classifiers (ANN, SVM, and KNN),

《机学五》KNN算法及实例

二次信任 提交于 2019-12-06 04:31:52
一、概述 【定义】如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。 二、距离计算公式 两个样本的距离可以通过如下公式计算,又叫【欧式距离】 设有特征,a(a1,a2,a3),b(b1,b2,b3),那么: \[\sqrt{(a1-b1)^{2}+(a2-b2)^{2}+(a3-b3)^{2}}\] 三、sklearn k-近邻算法API sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm='auto') n_neighbors:int,可选(默认= 5),k_neighbors查询默认使用的邻居数 algorithm:{‘auto’,‘ball_tree’,‘kd_tree’,‘brute’},可选用于计算最近邻居的算法: ‘ball_tree’将会使用 BallTree ‘kd_tree’将使用 KDTree ‘auto’将尝试根据传递给fit方法的值来决定最合适的算法 (不同实现方式影响效率) 四、实战 数据位置: https://www.kaggle.com/c/facebook-v-predicting-check-ins/data 五、数据的处理 1、缩小数据集范围 DataFrame.query() 2、处理日期数据 pd.to

User defined termvectors in ElasticSearch

亡梦爱人 提交于 2019-12-06 04:18:31
How (if at all possible) can one insert any term-vector in an ElasticSearch index? ES computes term-vectors, behind the scenes, in order to carry out it's text mining tasks, but it would be useful to be able to enter any list of (term, weight) pairs instead. Why? Well, for instance, though ES enables kNN (k-nearest-neighbors) for k=2, in the context of geographic proximity, it doesn't have any explicit k>2 functionality. If we were able to insert our own term-vectors, we could hack a k>2 functionality by harnessing ES's built in text-indexing methods. Any indications on this issue? As far as I

KNN with class weights in SKLearn [closed]

给你一囗甜甜゛ 提交于 2019-12-06 01:53:02
Closed. This question is off-topic . It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 months ago . Is it possible to define class weights for a K-nearest neighbour classifier in SKLearn? I have looked at the API but cannot work it out. I have a knn problem which has very imbalanced numbers of classes (10000 of some, to 1 of others). The original knn in sklearn does not seem to offer that option. You can alter the source code though by adding coefficients (weights) to the distance equation such that the

kNN进邻算法

放肆的年华 提交于 2019-12-06 01:45:52
一、算法概述 (1)采用测量不同特征值之间的距离方法进行分类 优点: 精度高、对异常值不敏感、无数据输入假定。 缺点: 计算复杂度高、空间复杂度高。 (2)KNN模型的三个要素 kNN算法模型实际上就是对特征空间的的划分。模型有三个基本要素:距离度量、K值的选择和分类决策规则的决定。 距离度量 距离定义为: L p ( x i , x j ) = ( ∑ l = 1 n | x ( l ) i − x ( l ) j | p ) 1 p Lp(xi,xj)=(∑l=1n|xi(l)−xj(l)|p)1p 一般使用欧式距离:p = 2的个情况 L p ( x i , x j ) = ( ∑ l = 1 n | x ( l ) i − x ( l ) j | 2 ) 1 2 Lp(xi,xj)=(∑l=1n|xi(l)−xj(l)|2)12 K值的选择 一般根据经验选择,需要多次选择对比才可以选择一个比较合适的K值。 如果K值太小,会导致模型太复杂,容易产生过拟合现象,并且对噪声点非常敏感。 如果K值太大,模型太过简单,忽略的大部分有用信息,也是不可取的。 分类决策规则 一般采用多数表决规则,通俗点说就是在这K个类别中,哪种类别最后就判别为哪种类型 二、实施kNN算法 2.1 伪代码 计算法已经类别数据集中的点与当前点之间的距离 按照距离递增次序排序 选取与但前点距离最小的k个点

KNN(最近邻)分类算法

*爱你&永不变心* 提交于 2019-12-06 00:52:35
一、算法原理 KNN算法是机器学习中最基本算法之一,属于惰性学习算法的典例。惰性指模型仅通过对训练数据集的记忆功能进行预测,而不产生判别函数。 KNN算法本身很简单,归纳为如下几步: ①选择近邻数量k和距离度量的方法 ②找到待分类样本的k个最近邻 ③根据最近邻类标进行多数投票 二、超参数(结合sklearn.neighbors.KNeighborsClassifier) 2.1 n_neighbors(近邻个数, default = 5) k近邻算法的k,近邻个数。 一般的根据经验k=5是最能得到最佳效果的点,但在实际开发过程中需要进行验证。 强调一点,如果在1到10中求得最佳k值为10,那么有必要对10以上的值选择区间再进行测试,因为可能含有效果更好的k值。 2.2 weights(距离权重, default = 'uniform') 基本的KNN算法仅仅通过找到待分类样本最近的k个样本进行多数投票,但可能存在如下情况: 如果按照投票的方式,应该分为蓝色类别,但从距离上看,样本距离红色类别更近,划为红色似乎更加合理,这里就需要引入距离权重的概念。 在KNeighborsClassifier中有一个参数weight,不指定该参数的情况下默认为uniform也就是多数投票,也可以指定weight为distance,即可采用距离权重的方式进行分类。 2.3 al gorithm

Creating a dataset from an image with Python for face recognition

半城伤御伤魂 提交于 2019-12-05 22:58:31
I am trying to code a face-recognition program in Python (I am going to apply k-nn algorithm to classify). First of all, I converted the images into greyscale and then I created a long column vector (by using Opencv's imagedata function) with the image's pixels (128x128= 16384 features total) So I got a dataset like the following (last column is the class label and I only showed first 7 features of the dataset instead of 16384). 176, 176, 175, 175, 177, 173, 178, 1 162, 161, 167, 162, 167, 166, 166, 2 But when I apply k-nn to this dataset, I get awkward results. Do I need to apply additional

KNN算法的重点以及实现

一世执手 提交于 2019-12-05 20:14:56
废话流 这学期选了岳晓冬老师的 机器学习基础, 这个老师很吊不多说,我很菜这个也不多说。 笨鸟先飞早入林,那就努力学习! 秋季学习选修了python数据分析,算是有一点基础,秋季的时候有写过knn、pca降维以及kmenas聚类的python实现,但仍有不少不理解的地方,借着这次机会一并干掉。 知识点整理 knn是一种基本的分类与回归方法。 knn没有显式的学习过程。 给定一个训练集,其中实例的类别已定。求出待预测实例距离训练集中实例的距离,选择距离最近的k个实例,通过多数表决的方式进行分类预测。 knn有三个基本的要素: 1.k值的选择(这个有点重要!!) 2.距离的度量 3.分类决策规则 k值的选择 1.选择过小:容易发生过拟合,模型过于复杂。预测结果对近邻实例点非常敏感。邻近节点变化就会引起预测结果的变化。 2.选择过大:容易发生欠拟合,模型过于简单。与输入实例距离较远的点也会有影响,使预测发生错误。 代码实现(python) 基本每行代码都有注释,简单易懂。今天先总结到这里,以后有新的领悟再来总结。 1 import numpy as np 2 import operator 3 4 #test_data是测试数据集,train_dataset训练数据集,train_label是标签 5 def knn_classify(test_data, train_dataset,

Unique assignment of closest points between two tables

随声附和 提交于 2019-12-05 12:22:04
In my Postgres 9.5 database with PostGis 2.2.0 installed, I have two tables with geometric data (points) and I want to assign points from one table to the points from the other table, but I don't want a buildings.gid to be assigned twice. As soon as one buildings.gid is assigned, it should not be assigned to another pvanlagen.buildid . Table definitions buildings : CREATE TABLE public.buildings ( gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass), osm_id character varying(11), name character varying(48), type character varying(16), geom geometry(MultiPolygon,4326), centroid

Value of k in k nearest neighbor algorithm

被刻印的时光 ゝ 提交于 2019-12-05 11:35:55
I have 7 classes that needs to be classified and I have 10 features. Is there a optimal value for k that I need to use in this case or do I have to run the KNN for values of k between 1 and 10 (around 10) and determine the best value with the help of the algorithm itself? In addition to the article I posted in the comments there is this one as well that suggests: Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are