similarity | 易学教程

Similarity between two data sets or arrays

阅读更多关于 Similarity between two data sets or arrays

Let's say I have a dataset that look like this: {A:1, B:3, C:6, D:6} I also have a list of other sets to compare my specific set: {A:1, B:3, C:6, D:6}, {A:2, B:3, C:6, D:6}, {A:99, B:3, C:6, D:6}, {A:5, B:1, C:6, D:9}, {A:4, B:2, C:2, D:6} My entries could be visualized as a Table (with four columns, A, B, C, D, and E). How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away. I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc and perhaps get a correlation

Similarity matrix -> feature vectors algorithm?

阅读更多关于 Similarity matrix -> feature vectors algorithm?

If we have a set of M words, and know the similarity of the meaning of each pair of words in advance (have a M x M matrix of similarities), which algorithm can we use to make one k-dimensional bit vector for each word, so that each pair of words can be compared just by comparing their vectors (e.g. getting the absolute difference of vectors)? I don't know how this particular problem is called. If I knew, it would be much easier to find among a bunch of algorithms with similar descriptions, which do something else. Additional observation: I think this algorithm would have to produce one, in

Algorithm for finding similar images using an index

阅读更多关于 Algorithm for finding similar images using an index

There are some surprisingly good image compare tools which find similar image even if it's not exactly the same (eg. change in size, wallpaper, brightness/contrast). I have some example applications here: Unique Filer 1.4 (shareware): https://web.archive.org/web/20010309014927/http://uniquefiler.com/ Fast Duplicate File Finder (Freeware): http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm Visual similarity duplicate image finder (payware): http://www.mindgems.com/products/VS-Duplicate-Image-Finder/VSDIF-About.htm Duplicate Checker (payware): http:/

Detect duplicated/similar text among large datasets?

阅读更多关于 Detect duplicated/similar text among large datasets?

问题 I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem? We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes 回答1: http://d3s.mff.cuni.cz/~holub/sw/shash/ http://matpalm.com/resemblance/simhash/ 回答2: This

hash function to index similar text

阅读更多关于 hash function to index similar text

问题 I'm searching about a sort of hash function to index similar text. So for example if we have two very long text called "A" and "B" where A and B differ not so much, then the hash function (called H) applied to A and B should return the same number. So H(A) = H(B) where A and B are similar text. I tried the "DoubleMetaphone" (I use italian language text), but I saw that it depends very strong from the string prefixes. For example: A = "This is the very long text that I want to hash" B = "This

Percentage Similarity Analysis (Java)

阅读更多关于 Percentage Similarity Analysis (Java)

问题 I have following situation: String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web"; Is there any idea or standard algorithm to calculate the percentage of similarity? For instance, above case, the similarity estimated by manual looking should be 90%++. My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100.

n-gram name analysis in non-english languages (CJK, etc)

阅读更多关于 n-gram name analysis in non-english languages (CJK, etc)

问题 I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not

OpenCV || contour similarity

阅读更多关于 OpenCV || contour similarity

问题 As you can see in the image, I would like to compare these contours. I need my OpenCV program to return TRUE when of these contours are compared to each other. They all kind off look the same but as you can see they are not exactly the same. The result you see here is what I have returned from the function findContours. So I am looking for the right approach for similarity for these contours. Any help would be amazing. Thank you very much in advance. 回答1: Take a look at cvMatchShapes() (which

Calculate similarity between list of words

阅读更多关于 Calculate similarity between list of words

I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list_B = ['email','mail','address','netmail'] In the above two list, we will find the cosine similarity between

Best way to rank sentences based on similarity from a set of Documents

阅读更多关于 Best way to rank sentences based on similarity from a set of Documents

I want to know the best way to rank sentences based on similarity from a set of documents. For e.g lets say, 1. There are 5 documents. 2. Each document contains many sentences. 3. Lets take Document 1 as primary, i.e output will contain sentences from this document. 4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd... Thanks in advance. I'll cover the basics of textual document matching... Most document similarity measures work on a word basis, rather than sentence structure. The first