similarity | 易学教程

Cosine Similarity

阅读更多关于 Cosine Similarity

问题 I was reading and came across this formula: The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix: M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]] Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i . I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5

Detect duplicated/similar text among large datasets?

阅读更多关于 Detect duplicated/similar text among large datasets?

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem? We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes http://d3s.mff.cuni.cz/~holub/sw/shash/ http://matpalm.com/resemblance/simhash/ This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your

Cosine Similarity of Vectors of different lengths?

阅读更多关于 Cosine Similarity of Vectors of different lengths?

问题 I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201, len(v)==246 cosine_distance(u, v) ValueError: objects are not aligned #this works though: cosine_distance(u[:200], v[:200]) >> 0.52230249969265641 Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with

n-gram name analysis in non-english languages (CJK, etc)

阅读更多关于 n-gram name analysis in non-english languages (CJK, etc)

I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature . First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word

OpenCV || contour similarity

阅读更多关于 OpenCV || contour similarity

As you can see in the image, I would like to compare these contours. I need my OpenCV program to return TRUE when of these contours are compared to each other. They all kind off look the same but as you can see they are not exactly the same. The result you see here is what I have returned from the function findContours. So I am looking for the right approach for similarity for these contours. Any help would be amazing. Thank you very much in advance. Take a look at cvMatchShapes() (which used to be call cvMatchContours() ). krzych To use the matchShapes() function you should pass vector<Point>

Visual similarity search algorithm

阅读更多关于 Visual similarity search algorithm

I'm trying to build a utility like this http://labs.ideeinc.com/multicolr , but I don't know which algorithm they are using, Does anyone know? johnnycrash All they are doing is matching histograms. So build a histogram for your images. Normalize the histograms by size of image. A histogram is a vector with as many elements as colors. You don't need 32,24, and maybe not even 16 bits of accuracy and this will just slow you down. For performance reasons, I would map the histograms down to 4, 8, and 10-12 bits. Do a fuzzy least distance compare between the all the 4 bit histograms and your sample

Tips to show similarities in files

阅读更多关于 Tips to show similarities in files

In a project, I found some css files that "smell" like there are copy-pasted rules in them. I wonder what are your strategies for detecting copy-paste stuff in files. Just of curiosity i'd like to hear your tips and tricks for showing file similarities! The Chairman Try Simian . It is used for copy-paste-detection in source code (Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy), but you can run this on plain text files too. Ed Guiness There is a Copy-Paste Detection (CPD) project on sourceforge; http://pmd.sourceforge.net/cpd.html But even in large projects I find my

How to compare image similarity using php regardless of scale, rotation?

阅读更多关于 How to compare image similarity using php regardless of scale, rotation?

I want to compare similarity between below images. Acording to my requirements I want to identify all of these images as similar, since it has use the same color, same clip art. The only difference in these images are rotation ,scale and the placement of the clip art. Since all 3 t-shirts has used the same color and clip art I want to identify all 3 images as similar. I tried out the method described in hackerfactor.com . But it doesn't give me correct result acording to my requirements. How to identify all these images as similar?DO you have any suggestions? Please help me. The below images

What is the best algorithm for matching two string containing less than 10 words in latin script

阅读更多关于 What is the best algorithm for matching two string containing less than 10 words in latin script

问题 I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common. Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms

Algorithm to find related words in a text

阅读更多关于 Algorithm to find related words in a text

问题 I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple". Any idea on how to solve this? 回答1: As a starting point: your question relates to text mining. There are two ways: a statistical approach, and one form natural language processing (nlp). I do not know much about nlp, but can say something about the statistical approach: You