similarity

Similarity between two data sets or arrays

自闭症网瘾萝莉.ら 提交于 2019-12-05 08:01:14
Let's say I have a dataset that look like this: {A:1, B:3, C:6, D:6} I also have a list of other sets to compare my specific set: {A:1, B:3, C:6, D:6}, {A:2, B:3, C:6, D:6}, {A:99, B:3, C:6, D:6}, {A:5, B:1, C:6, D:9}, {A:4, B:2, C:2, D:6} My entries could be visualized as a Table (with four columns, A, B, C, D, and E). How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away. I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc and perhaps get a correlation

Similarity matrix -> feature vectors algorithm?

烂漫一生 提交于 2019-12-05 07:42:16
If we have a set of M words, and know the similarity of the meaning of each pair of words in advance (have a M x M matrix of similarities), which algorithm can we use to make one k-dimensional bit vector for each word, so that each pair of words can be compared just by comparing their vectors (e.g. getting the absolute difference of vectors)? I don't know how this particular problem is called. If I knew, it would be much easier to find among a bunch of algorithms with similar descriptions, which do something else. Additional observation: I think this algorithm would have to produce one, in

Algorithm for finding similar images using an index

浪子不回头ぞ 提交于 2019-12-05 03:25:51
There are some surprisingly good image compare tools which find similar image even if it's not exactly the same (eg. change in size, wallpaper, brightness/contrast). I have some example applications here: Unique Filer 1.4 (shareware): https://web.archive.org/web/20010309014927/http://uniquefiler.com/ Fast Duplicate File Finder (Freeware): http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm Visual similarity duplicate image finder (payware): http://www.mindgems.com/products/VS-Duplicate-Image-Finder/VSDIF-About.htm Duplicate Checker (payware): http:/

Detect duplicated/similar text among large datasets?

断了今生、忘了曾经 提交于 2019-12-05 02:37:55
问题 I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem? We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes 回答1: http://d3s.mff.cuni.cz/~holub/sw/shash/ http://matpalm.com/resemblance/simhash/ 回答2: This

hash function to index similar text

独自空忆成欢 提交于 2019-12-05 02:21:56
问题 I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number. So H(A) = H(B) where A and B are similar text. I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example: A = "This is the very long text that I want to hash" B = "This

Percentage Similarity Analysis (Java)

拟墨画扇 提交于 2019-12-05 01:55:26
问题 I have following situation: String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web"; Is there any idea or standard algorithm to calculate the percentage of similarity? For instance, above case, the similarity estimated by manual looking should be 90%++. My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100.

n-gram name analysis in non-english languages (CJK, etc)

て烟熏妆下的殇ゞ 提交于 2019-12-05 01:24:37
问题 I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not

OpenCV || contour similarity

旧时模样 提交于 2019-12-05 00:11:29
问题 As you can see in the image, I would like to compare these contours. I need my OpenCV program to return TRUE when of these contours are compared to each other. They all kind off look the same but as you can see they are not exactly the same. The result you see here is what I have returned from the function findContours. So I am looking for the right approach for similarity for these contours. Any help would be amazing. Thank you very much in advance. 回答1: Take a look at cvMatchShapes() (which

Calculate similarity between list of words

不羁岁月 提交于 2019-12-04 22:05:20
I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list_B = ['email','mail','address','netmail'] In the above two list, we will find the cosine similarity between

Best way to rank sentences based on similarity from a set of Documents

二次信任 提交于 2019-12-04 21:44:13
I want to know the best way to rank sentences based on similarity from a set of documents. For e.g lets say, 1. There are 5 documents. 2. Each document contains many sentences. 3. Lets take Document 1 as primary, i.e output will contain sentences from this document. 4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd... Thanks in advance. I'll cover the basics of textual document matching... Most document similarity measures work on a word basis, rather than sentence structure. The first