distance

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

萝らか妹 提交于 2019-12-03 16:58:20
问题 My goal is to compute the KL distance between the following text documents: 1)The boy is having a lad relationship 2)The boy is having a boy relationship 3)It is a lovely day in NY I first of all vectorised the documents in order to easily apply numpy 1)[1,1,1,1,1,1,1] 2)[1,2,1,1,1,2,1] 3)[1,1,1,1,1,1,1] I then applied the following code for computing KL distance between the texts: import numpy as np import math from math import log v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]] c=v[0]

calculate distance between each pair of coordinates in wide dataframe

浪子不回头ぞ 提交于 2019-12-03 16:16:57
I want to calculate the distance between two linked set of spatial coordinates ( program and admin in my fake dataset). The data are in a wide format, so both pairs of coordinates are in the same row. library(sp) set.seed(1) n <- 100 program.id <- seq(1, n) c1 <- cbind(runif(n, -90, 90), runif(n, -180, 180)) c2 <- cbind(runif(n, -90, 90), runif(n, -180, 180)) dat <- data.frame(cbind(program.id, c1, c2)) names(dat) <- c("program.id", "program.lat", "program.long", "admin.lat", "admin.long") head(dat) # program.id program.lat program.long admin.lat admin.long # 1 1 -42.20844 55.70061 -41.848523

using k-NN in R with categorical values

元气小坏坏 提交于 2019-12-03 15:45:22
I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit. I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance. Is there a way to use common kNN implementations like the one in {class} with different distance metric functions? I'm using R 2.15 As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special

Cutting dendrogram into n trees with minimum cluster size in R

泄露秘密 提交于 2019-12-03 15:04:55
I'm trying to use hirearchical clustering (specifically hclust ) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time consuming. I've experimented with the dynamicTreeCut package, but can't figure out how to enter these

How to find the point most distant from a given set and its bounding box

廉价感情. 提交于 2019-12-03 13:19:25
问题 I have a bounding box, and a number of points inside of it. I'd like to add another point whose location is farthest away from any previously-added points, as well as far away from the edges of the box. Is there a common solution for this sort of thing? Thanks! 回答1: Here is a little Mathematica program. Although it is only two lines of code ( ! ) you'll probably need more in a conventional language, as well as a math library able to find maximum of functions. I assume you are not fluent in

Finding a point on a Bézier curve when given the distance from the start point?

梦想与她 提交于 2019-12-03 13:02:22
I created a 4 point Bézier curve, and a distance. Starting at the start point, how do I find the x,y coordinates of a point which is that distance away from the start point? I've looked at the other examples, and from what I can tell, they approximate the values by dividing the curve into several thousand points, then finding the nearest point. This will not work for me. For what I'm doing, I'd like to be accurate to only two decimal places. Below is a simple form of what I have to create my Bézier curve. (The y values are arbitrary, the x values are always 352 pixels apart). If it matters, I

Find color names for colors close to colorBrewer palette

こ雲淡風輕ζ 提交于 2019-12-03 12:59:53
I want to use the R package SNA to do social network analysis. SNA colors elements only using R color names (text names). I'd like to find near matches from a ColorBrewer palette (set3) to the color names in R. There aren't many exact matches in the RGB space. require(RColorBrewer) brew10 <- brewer.pal(10, "Set3") rcol <- colors() brew10rgb <- col2rgb(brew10) allrgb <- col2rgb(rcol) apply(t(brew10rgb), 1, paste, collapse="$$") %in% apply(t(allrgb), 1, paste,collapse="$$") brew10rgb[,1] fltr <- allrgb[1,]==141 allrgb[,fltr] fltr <- allrgb[2,]==211 allrgb[,fltr] Is there a way to pick good color

Optimized method for calculating cosine distance in Python

泄露秘密 提交于 2019-12-03 12:39:19
I wrote a method to calculate the cosine distance between two arrays: def cosine_distance(a, b): if len(a) != len(b): return False numerator = 0 denoma = 0 denomb = 0 for i in range(len(a)): numerator += a[i]*b[i] denoma += abs(a[i])**2 denomb += abs(b[i])**2 result = 1 - numerator / (sqrt(denoma)*sqrt(denomb)) return result Running it can be very slow on a large array. Is there an optimized version of this method that would run faster? Update: I've tried all the suggestions to date, including scipy. Here's the version to beat, incorporating suggestions from Mike and Steve: def cosine_distance

Fastest way to find the closest point to a given point in 3D, in Python

痞子三分冷 提交于 2019-12-03 11:53:38
So lets say I have 10,000 points in A and 10,000 points in B and want to find out the closest point in A for every B point. Currently, I simply loop through every point in B and A to find which one is closest in distance. ie. B = [(.5, 1, 1), (1, .1, 1), (1, 1, .2)] A = [(1, 1, .3), (1, 0, 1), (.4, 1, 1)] C = {} for bp in B: closestDist = -1 for ap in A: dist = sum(((bp[0]-ap[0])**2, (bp[1]-ap[1])**2, (bp[2]-ap[2])**2)) if(closestDist > dist or closestDist == -1): C[bp] = ap closestDist = dist print C However, I am sure there is a faster way to do this... any ideas? I typically use a kd-tree

Cosine distance as vector distance function for k-means

我只是一个虾纸丫 提交于 2019-12-03 11:52:16
I have a graph of N vertices where each vertex represents a place. Also I have vectors, one per user, each one of N coefficients where the coefficient's value is the duration in seconds spent at the corresponding place or 0 if that place was not visited. E.g. for the graph: the vector: v1 = {100, 50, 0 30, 0} would mean that we spent: 100secs at vertex 1 50secs at vertex 2 and 30secs at vertex 4 (vertices 3 & 5 where not visited, thus the 0s). I want to run a k-means clustering and I've chosen cosine_distance = 1 - cosine_similarity as the metric for the distances, where the formula for cosine