distance | 易学教程

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

阅读更多关于 Computation of Kullback-Leibler (KL) distance between text-documents using numpy

问题 My goal is to compute the KL distance between the following text documents: 1)The boy is having a lad relationship 2)The boy is having a boy relationship 3)It is a lovely day in NY I first of all vectorised the documents in order to easily apply numpy 1)[1,1,1,1,1,1,1] 2)[1,2,1,1,1,2,1] 3)[1,1,1,1,1,1,1] I then applied the following code for computing KL distance between the texts: import numpy as np import math from math import log v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]] c=v[0]

calculate distance between each pair of coordinates in wide dataframe

阅读更多关于 calculate distance between each pair of coordinates in wide dataframe

I want to calculate the distance between two linked set of spatial coordinates ( program and admin in my fake dataset). The data are in a wide format, so both pairs of coordinates are in the same row. library(sp) set.seed(1) n <- 100 program.id <- seq(1, n) c1 <- cbind(runif(n, -90, 90), runif(n, -180, 180)) c2 <- cbind(runif(n, -90, 90), runif(n, -180, 180)) dat <- data.frame(cbind(program.id, c1, c2)) names(dat) <- c("program.id", "program.lat", "program.long", "admin.lat", "admin.long") head(dat) # program.id program.lat program.long admin.lat admin.long # 1 1 -42.20844 55.70061 -41.848523

using k-NN in R with categorical values

阅读更多关于 using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit. I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance. Is there a way to use common kNN implementations like the one in {class} with different distance metric functions? I'm using R 2.15 As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special

Cutting dendrogram into n trees with minimum cluster size in R

阅读更多关于 Cutting dendrogram into n trees with minimum cluster size in R

I'm trying to use hirearchical clustering (specifically hclust ) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming. I've experimented with the dynamicTreeCut package, but can't figure out how to enter these

How to find the point most distant from a given set and its bounding box

阅读更多关于 How to find the point most distant from a given set and its bounding box

问题 I have a bounding box, and a number of points inside of it. I'd like to add another point whose location is farthest away from any previously-added points, as well as far away from the edges of the box. Is there a common solution for this sort of thing? Thanks! 回答1: Here is a little Mathematica program. Although it is only two lines of code ( ! ) you'll probably need more in a conventional language, as well as a math library able to find maximum of functions. I assume you are not fluent in

Finding a point on a Bézier curve when given the distance from the start point?

阅读更多关于 Finding a point on a Bézier curve when given the distance from the start point?

I created a 4 point Bézier curve, and a distance. Starting at the start point, how do I find the x,y coordinates of a point which is that distance away from the start point? I've looked at the other examples, and from what I can tell, they approximate the values by dividing the curve into several thousand points, then finding the nearest point. This will not work for me. For what I'm doing, I'd like to be accurate to only two decimal places. Below is a simple form of what I have to create my Bézier curve. (The y values are arbitrary, the x values are always 352 pixels apart). If it matters, I

Find color names for colors close to colorBrewer palette

阅读更多关于 Find color names for colors close to colorBrewer palette

I want to use the R package SNA to do social network analysis. SNA colors elements only using R color names (text names). I'd like to find near matches from a ColorBrewer palette (set3) to the color names in R. There aren't many exact matches in the RGB space. require(RColorBrewer) brew10 <- brewer.pal(10, "Set3") rcol <- colors() brew10rgb <- col2rgb(brew10) allrgb <- col2rgb(rcol) apply(t(brew10rgb), 1, paste, collapse="$$") %in% apply(t(allrgb), 1, paste,collapse="$$") brew10rgb[,1] fltr <- allrgb[1,]==141 allrgb[,fltr] fltr <- allrgb[2,]==211 allrgb[,fltr] Is there a way to pick good color

Optimized method for calculating cosine distance in Python

阅读更多关于 Optimized method for calculating cosine distance in Python

I wrote a method to calculate the cosine distance between two arrays: def cosine_distance(a, b): if len(a) != len(b): return False numerator = 0 denoma = 0 denomb = 0 for i in range(len(a)): numerator += a[i]*b[i] denoma += abs(a[i])**2 denomb += abs(b[i])**2 result = 1 - numerator / (sqrt(denoma)*sqrt(denomb)) return result Running it can be very slow on a large array. Is there an optimized version of this method that would run faster? Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve: def cosine_distance

Fastest way to find the closest point to a given point in 3D, in Python

阅读更多关于 Fastest way to find the closest point to a given point in 3D, in Python

So lets say I have 10,000 points in A and 10,000 points in B and want to find out the closest point in A for every B point. Currently, I simply loop through every point in B and A to find which one is closest in distance. ie. B = [(.5, 1, 1), (1, .1, 1), (1, 1, .2)] A = [(1, 1, .3), (1, 0, 1), (.4, 1, 1)] C = {} for bp in B: closestDist = -1 for ap in A: dist = sum(((bp[0]-ap[0])**2, (bp[1]-ap[1])**2, (bp[2]-ap[2])**2)) if(closestDist > dist or closestDist == -1): C[bp] = ap closestDist = dist print C However, I am sure there is a faster way to do this... any ideas? I typically use a kd-tree

Cosine distance as vector distance function for k-means

阅读更多关于 Cosine distance as vector distance function for k-means

I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine

订阅 distance