similarity

SQL - Similarity between two strings of varying length

微笑、不失礼 提交于 2019-12-21 04:26:34
问题 I have a SQL Server table of products, and each product has a description that is publicly available on our website. I want to prevent, or at least warn our users when, a description is too similar to another product's description. Each product's description length can greatly vary. I'd like query for products with descriptions that include duplicate/similar paragraphs/blocks of text between one another. i.e. String A has a bunch of unique content, but shares a similar/identical paragraph w/

Tips to show similarities in files

我是研究僧i 提交于 2019-12-21 04:13:04
问题 In a project, I found some css files that "smell" like there are copy-pasted rules in them. I wonder what are your strategies for detecting copy-paste stuff in files. Just of curiosity i'd like to hear your tips and tricks for showing file similarities! 回答1: Try Simian. It is used for copy-paste-detection in source code (Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy), but you can run this on plain text files too. 回答2: There is a Copy-Paste Detection (CPD) project on

How to detect similar Images in PHP?

无人久伴 提交于 2019-12-21 02:50:06
问题 I have many files of a same picture in various resolution, suitable for every devices like mobile, pc, psp etc. Now I am trying to display only unique pictures in the page, but I dont know how to. I could have avoided this if I maintained a database at the first place, but I didn't. And I need your help detecting the largest unique pictures. 回答1: Well, even thou there are quite a few algorithms to do that, i believe it would still be faster to do that manually. Download all the images feed

Python equivalent of daisy() in the cluster package of R

微笑、不失礼 提交于 2019-12-20 09:38:17
问题 I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows: if(!require("cluster")) { install.packages("cluster"); require("cluster") } data(flower) as.matrix(daisy(flower, metric = "gower")) This uses the gower metric to deal with the nominal variables

How to find a similar word for a misspelled one in PHP?

大城市里の小女人 提交于 2019-12-20 09:01:47
问题 I'll explain my problem: I have a database table called country . It has two columns: ID and name . When I want to search for 'paris' , but misspelled the word: 'pares' ( 'e' instead of 'i' ), I won't get any result from DB. I want the the system to suggest similar words that could help in the search. So, I am looking for help writing a script that makes suggestions from the DB that contain similar words like: paris, paredes, ... etc. 回答1: In PHP you should use metaphone it is more accurate

Finding the closest match

微笑、不失礼 提交于 2019-12-18 12:01:30
问题 I Have an object with a set of parameters like: var obj = new {Param1 = 100; Param2 = 212; Param3 = 311; param4 = 11; Param5 = 290;} On the other side i have a list of object: var obj1 = new {Param1 = 1221 ; Param2 = 212 ; Param3 = 311 ; param4 = 11 ; Param5 = 290 ; } var obj3 = new {Param1 = 35 ; Param2 = 11 ; Param3 = 319 ; param4 = 211 ; Param5 = 790 ; } var obj4 = new {Param1 = 126 ; Param2 = 218 ; Param3 = 2 ; param4 = 6 ; Param5 = 190 ; } var obj5 = new {Param1 = 213 ; Param2 = 121 ;

Cosine Similarity

試著忘記壹切 提交于 2019-12-17 21:57:29
问题 I calculated tf/idf values of two documents. The following are the tf/idf values: 1.txt 0.0 0.5 2.txt 0.0 0.5 The documents are like: 1.txt = > dog cat 2.txt = > cat elephant How can I use these values to calculate cosine similarity? I know that I should calculate the dot product, then find distance and divide dot product by it. How can I calculate this using my values? One more question: Is it important that both documents should have same number of words? 回答1: a * b sim(a,b) =-------- |a|*

How do I create a simliarity matrix in MATLAB?

让人想犯罪 __ 提交于 2019-12-17 20:55:43
问题 I am working towards comparing multiple images. I have these image data as column vectors of a matrix called "images." I want to assess the similarity of images by first computing their Eucledian distance. I then want to create a matrix over which I can execute multiple random walks. Right now, my code is as follows: % clear % clc % close all % % load tea.mat; images = Input.X; M = zeros(size(images, 2), size (images, 2)); for i = 1:size(images, 2) for j = 1:size(images, 2) normImageTemp =

Similar UTF-8 strings for autocomplete field

醉酒当歌 提交于 2019-12-17 19:54:31
问题 Background Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm module. Problem The code resembles the following: SELECT t.label FROM the_table t WHERE label % 'fil' ORDER BY similarity( t.label, 'fil' ) DESC When the user types fil , the query matches filbert but not filé powder . (Because of the accented character?) Failed Solution #1 I tried to implement an

Javascript text similarity algorithm

一个人想着一个人 提交于 2019-12-17 18:38:43
问题 I'm building a website that should collect various news feeds and would like the texts to be compared for similarity. What i need is some sort of a news text similarity algorithm . I know that php has the similar_text function and am not sure how good it is + i need it for javascript. So if anyone could point me to an example or a plugin or any instruction on how this is possible or at least where to look and start investigating. 回答1: There's a javascript implementation of the Levenshtein