similarity

How to detect similar Images in PHP?

旧巷老猫 提交于 2019-12-03 09:39:21
I have many files of a same picture in various resolution, suitable for every devices like mobile, pc, psp etc. Now I am trying to display only unique pictures in the page, but I dont know how to. I could have avoided this if I maintained a database at the first place, but I didn't. And I need your help detecting the largest unique pictures. Well, even thou there are quite a few algorithms to do that, i believe it would still be faster to do that manually. Download all the images feed them into something like windows live photo gallery or any other software which could match similar images.

Creating a Gin Index with Trigram (gin_trgm_ops) in Django model

点点圈 提交于 2019-12-03 09:27:04
问题 The new TrigramSimilarity feature of the django.contrib.postgres was great for a problem I had. I use it for a search bar to find hard to spell latin names. The problem is that there are over 2 million names, and the search takes longer then I want. I'd like to create a index on the trigrams as descibed in the postgres documentation https://www.postgresql.org/docs/9.6/static/pgtrgm.html But I am not sure how to do this in a way that the Django API would make use of it. For the postgres text

Using numba for cosine similarity between a vector and rows in a matix

匿名 (未验证) 提交于 2019-12-03 09:14:57
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Found this gist using numba for fast computation of cosine similarity. import numba @numba.jit(target='cpu', nopython=True) def fast_cosine(u, v): m = u.shape[0] udotv = 0 u_norm = 0 v_norm = 0 for i in range(m): if (np.isnan(u[i])) or (np.isnan(v[i])): continue udotv += u[i] * v[i] u_norm += u[i] * u[i] v_norm += v[i] * v[i] u_norm = np.sqrt(u_norm) v_norm = np.sqrt(v_norm) if (u_norm == 0) or (v_norm == 0): ratio = 1.0 else: ratio = udotv / (u_norm * v_norm) return ratio Results look promising (500ns vs. only 200us without jit decorator in

How do you measure similarity between 2 series of data?

二次信任 提交于 2019-12-03 08:48:33
I need to find a similarity measurement between two arrays of data. You can call similarity measurement whatever you want, difference, correlation or whatever. For example: 1, 2, 3, 4, 5 < Series 1 2, 3, 4, 5, 6 < Series 2 Should be far more similar to each other than these 2 series: 1, 2, 3, 4, 5 < Series 1 1, 1, 5, 8, 7 < Series 2 Any suggestions? Is there a source code available for it? You can calculate the sample Pearson product-moment correlation coefficient : "The above formula suggests a convenient single-pass algorithm for calculating sample correlations". Write a loop to calculate

strategies for finding duplicate mailing addresses

落花浮王杯 提交于 2019-12-03 08:17:45
I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses: addr_1 = '# 3 FAIRMONT LINK SOUTH' addr_2 = '3 FAIRMONT LINK S' addr_3 = '5703 - 48TH AVE' adrr_4 = '5703- 48 AVENUE' I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?

Effective clustering of a similarity matrix

喜欢而已 提交于 2019-12-03 07:53:25
my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php. 1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use filtering (only "real" words) tokenization (split sentences into words) stemming (reduce words to their base form; Porter's stemmer) pruning (cut of words with too high & low frequency) as methods for

How to Normalize similarity measures from Wordnet

一笑奈何 提交于 2019-12-03 07:44:29
I am trying to calculate semantic similarity between two words. I am using Wordnet-based similarity measures i.e Resnik measure(RES), Lin measure(LIN), Jiang and Conrath measure(JNC) and Banerjee and Pederson measure(BNP). To do that, I am using nltk and Wordnet 3.0. Next, I want to combine the similarity values obtained from different measure. To do that i need to normalize the similarity values as some measure give values between 0 and 1, while others give values greater than 1. So, my question is how do I normalize the similarity values obtained from different measures. Extra detail on what

Detecting image equality at different resolutions

左心房为你撑大大i 提交于 2019-12-03 07:34:40
问题 I'm trying to build a script to go through my original, high-res photos and replace the old, low-res ones I uploaded to Flickr before I had a pro account. For many of them I can just use Exif info such as date taken to determine a match. But some are really old, and either the original file didn't have Exif info, or it got clobbered by whatever stupid resizing software I used at the time. So, unable to rely on metadata, I'm forced to resort to the content itself. The problem is that the

Cosine Similarity of Vectors of different lengths?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-03 06:28:47
I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201, len(v)==246 cosine_distance(u, v) ValueError: objects are not aligned #this works though: cosine_distance(u[:200], v[:200]) >> 0.52230249969265641 Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with vectors of different lengths. I'm using this function : def cosine_distance(u, v): """ Returns the

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

我的未来我决定 提交于 2019-12-03 04:36:30
问题 I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms. But I'm not sure using what descriptor as frequency domain based feature, since there are