similarity

Cosine Similarity

走远了吗. 提交于 2019-12-04 01:54:41
问题 I was reading and came across this formula: The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix: M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]] Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i . I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5

Detect duplicated/similar text among large datasets?

旧街凉风 提交于 2019-12-03 20:59:23
I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem? We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes http://d3s.mff.cuni.cz/~holub/sw/shash/ http://matpalm.com/resemblance/simhash/ This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your

Cosine Similarity of Vectors of different lengths?

柔情痞子 提交于 2019-12-03 17:30:56
问题 I'm trying to use TF-IDF to sort documents into categories. I've calculated the tf_idf for some documents, but now when I try to calculate the Cosine Similarity between two of these documents I get a traceback saying: #len(u)==201, len(v)==246 cosine_distance(u, v) ValueError: objects are not aligned #this works though: cosine_distance(u[:200], v[:200]) >> 0.52230249969265641 Is slicing the vector so that len(u)==len(v) the right approach? I would think that cosine similarity would work with

n-gram name analysis in non-english languages (CJK, etc)

不羁岁月 提交于 2019-12-03 16:33:34
I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature . First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word

OpenCV || contour similarity

杀马特。学长 韩版系。学妹 提交于 2019-12-03 14:11:18
As you can see in the image, I would like to compare these contours. I need my OpenCV program to return TRUE when of these contours are compared to each other. They all kind off look the same but as you can see they are not exactly the same. The result you see here is what I have returned from the function findContours. So I am looking for the right approach for similarity for these contours. Any help would be amazing. Thank you very much in advance. Take a look at cvMatchShapes() (which used to be call cvMatchContours() ). krzych To use the matchShapes() function you should pass vector<Point>

Visual similarity search algorithm

徘徊边缘 提交于 2019-12-03 13:12:35
I'm trying to build a utility like this http://labs.ideeinc.com/multicolr , but I don't know which algorithm they are using, Does anyone know? johnnycrash All they are doing is matching histograms. So build a histogram for your images. Normalize the histograms by size of image. A histogram is a vector with as many elements as colors. You don't need 32,24, and maybe not even 16 bits of accuracy and this will just slow you down. For performance reasons, I would map the histograms down to 4, 8, and 10-12 bits. Do a fuzzy least distance compare between the all the 4 bit histograms and your sample

Tips to show similarities in files

这一生的挚爱 提交于 2019-12-03 12:44:39
In a project, I found some css files that "smell" like there are copy-pasted rules in them. I wonder what are your strategies for detecting copy-paste stuff in files. Just of curiosity i'd like to hear your tips and tricks for showing file similarities! The Chairman Try Simian . It is used for copy-paste-detection in source code (Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy), but you can run this on plain text files too. Ed Guiness There is a Copy-Paste Detection (CPD) project on sourceforge; http://pmd.sourceforge.net/cpd.html But even in large projects I find my

How to compare image similarity using php regardless of scale, rotation?

那年仲夏 提交于 2019-12-03 10:52:22
I want to compare similarity between below images. Acording to my requirements I want to identify all of these images as similar, since it has use the same color, same clip art. The only difference in these images are rotation ,scale and the placement of the clip art. Since all 3 t-shirts has used the same color and clip art I want to identify all 3 images as similar. I tried out the method described in hackerfactor.com . But it doesn't give me correct result acording to my requirements. How to identify all these images as similar?DO you have any suggestions? Please help me. The below images

What is the best algorithm for matching two string containing less than 10 words in latin script

你说的曾经没有我的故事 提交于 2019-12-03 10:12:04
问题 I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common. Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms

Algorithm to find related words in a text

こ雲淡風輕ζ 提交于 2019-12-03 10:03:59
问题 I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple". Any idea on how to solve this? 回答1: As a starting point: your question relates to text mining. There are two ways: a statistical approach, and one form natural language processing (nlp). I do not know much about nlp, but can say something about the statistical approach: You