knn

从负无穷学习机器学习(一)

我的梦境 提交于 2019-12-14 11:09:19
适逢双十一,买了一本名为《深入浅出Python机器学习》的书,作者生动描述机器学习的原理,爱了!ヾ(◍°∇°◍)ノ゙ 一、基础必需的库 (一)、numpy——基础科学计算库 import numpy #基础科学计算库 i = numpy . array ( [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] ] ) #为i赋值一个数组 print ( "i:\n{}" . format ( i ) ) #输出数组i (二)、scipy——科学计算工具集 import numpy as np from scipy import sparse matrix = np . eye ( 3 ) #创建一个3阶对角阵 sparse_matrix = sparse . csr_matrix ( matrix ) #把np数组转化成CSR格式的Scipy稀疏矩阵(sparse matrix) #sparse函数只会存储非0元素 print ( "对角矩阵:\n {}" . format ( matrix ) ) #打印数组 print ( "\n sparse matrix:\n{}" . format ( sparse_matrix ) ) #上下两矩阵进行对比 (三)、pandas——数据分析 #导入数据分析工具 import pandas data = { "Name" : [

Efficient KNN implementation which allows inserts

假装没事ソ 提交于 2019-12-13 18:24:20
问题 Suppose I have multi-dimensional datasets, which have many vectors as data. I am writing an algorithm which needs to do k nearest neighbour searches for all those vectors - classical KNN. However, during my algorithm I add new vectors to the overall dataset and need to include those new vectors into my KNN search. I want to do that efficiently. I looked into KD tree and ball tree of scikit-learn, but they don't allow inserts (by the nature of the concepts). I am not sure whether SR tree or R

mlpack nearest neighbor with cosine distance?

随声附和 提交于 2019-12-13 18:23:11
问题 I'd like to use the NeighborSearch class in mlpack to perform KNN classification on some vectors representing documents. I'd like to use Cosine Distance, but I'm having trouble. I think the way to do this is to use the inner-product metric "IPMetric" and specify the CosineDistance kernel... This is what I have: NeighborSearch<NearestNeighborSort, IPMetric<CosineDistance>> nn(X_train); But I get the following compile errors: /usr/include/mlpack/core/tree/hrectbound_impl.hpp:211:15: error:

A discrepancy in computing nearest neighbours between R and Java + WEKA

此生再无相见时 提交于 2019-12-13 16:09:18
问题 I am in the process of debugging a library and another implementation which involves computing k-nearest neighbours. I am framing the question with an example which I am having difficulty to understand. First I will explain demonstrate the thing with a toy example, then show the output which will lead to the question. Task The demo here reads a csv file having 10 number of 2-dimensional datapoints. The task is to find the distance of all the datapoints from the first datapoint, and list all

K-Nearest Neighbor Implementation for Strings (Unstructured data) in Java

孤者浪人 提交于 2019-12-13 07:53:37
问题 I'm looking for implementation for K-Nearest Neighbor algorithm in Java for unstructured data. I found many implementation for numeric data, however how I can implement it and calculate the Euclidean Distance for text (Strings). Here is one example for double: public static double EuclideanDistance(double [] X, double []Y) { int count = 0; double distance = 0.0; double sum = 0.0; if(X.length != Y.length) { try { throw new Exception("the number of elements" + " in X must match the number of

How to find 'feature importance' or variable importance graph for KNNClassifier()

拈花ヽ惹草 提交于 2019-12-13 03:49:48
问题 I am working on a numerical dataset using KNN Classifier of sklearn package. Once the prediction is complete, the top 4 important variables should be displayed in a bar graph. Here is the solution I have tried, but it throws an error that feature_importances is not an attribute of KNNClassifier: neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(X_train, y_train) y_pred = neigh.predict(X_test) (pd.Series(neigh.feature_importances_, index=X_test.columns) .nlargest(4) .plot(kind='barh')) Now

Q: KNN in R — strange behavior

≯℡__Kan透↙ 提交于 2019-12-12 18:15:01
问题 Does anyone know why the below KNN R code gives different predictions for different seeds? This is strange as K<-5, and thus the majority is well defined. In addition, the floating numbers are large -- so no precision of data problem arises (like in this post). library(class) set.seed(642002713) m = 20 n = 1000 from = -(2^30) to = -(from) train = matrix(runif(m*n, from, to), nrow=m, ncol=n) trainLabels = sample.int(2, size = m, replace=T)-1 test = matrix(runif(n, from, to), nrow=1) K <- 5

机器学习算法(一)——KNN算法

微笑、不失礼 提交于 2019-12-12 15:33:04
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 一、算法简介 k邻近算法( k-nearest neighbors ),即 KNN 算法。 KNN是一种基于实例学习( instance-based learning ),或者所是将所有计算推迟到分类之后的惰性学习( lazy learning )的一种算法,KNN可以说是最简单的分类算法之一,同时,它也是最常用的分类算法之一,注意KNN算法是有监督学习中的分类算法。 二、算法原理 KNN算法的思路是: 如果一个样本在特征空间中的 k 个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。通常 K 是不大于20的奇数。 用一句我们中国的古话来说,就是“近朱者赤,近墨者黑。” 举个例子: 一只小鹰出生怎么判断自己是一只鹰还是一只鸡呢? 如下图,我们假设绿色代表这只小鹰。红色代表鸡,蓝色代表鹰。 它睁开眼看了一下周围3只动物,两只是鸡,一直是鹰。他就把自己分类为鸡。当他看的比较远的时候,他看到了周围5只动物。发现三只是鹰,两只是鸡。他就把自己分类为鹰。 从上面的例子中,我们可以看出该算法涉及3个主要因素:训练集、距离或相似的衡量、k的大小。 该算法的基本步骤 算距离:给定测试对象,计算它与训练集中的每个对象的距离 找邻居:圈定距离最近的k个训练对象,作为测试对象的近邻 做分类

Creating a dataset from an image with Python for face recognition

泪湿孤枕 提交于 2019-12-12 09:53:36
问题 I am trying to code a face-recognition program in Python (I am going to apply k-nn algorithm to classify). First of all, I converted the images into greyscale and then I created a long column vector (by using Opencv's imagedata function) with the image's pixels (128x128= 16384 features total) So I got a dataset like the following (last column is the class label and I only showed first 7 features of the dataset instead of 16384). 176, 176, 175, 175, 177, 173, 178, 1 162, 161, 167, 162, 167,

Why does k=1 in KNN give the best accuracy?

痞子三分冷 提交于 2019-12-12 01:36:26
问题 I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained? 回答1: If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners. In