similarity | 易学教程

Finding similar strings with PostgreSQL quickly

阅读更多关于 Finding similar strings with PostgreSQL quickly

I need to create a ranking of similar strings in a table. I have the following table create table names ( name character varying(255) ); Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests : CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops); and I'm executing the following query: select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name from names n1, names n2 where n1.name != n2.name and similarity(n1.name, n2.name) > .8 order by sim desc; The query works, but is

Comparing strings with tolerance

阅读更多关于 Comparing strings with tolerance

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on. Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this. You could use the Levenshtein Distance algorithm . "The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with

Word comparison algorithm

阅读更多关于 Word comparison algorithm

问题 I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record: 1, John Doe, ACME Comapny (the typo is on purpose) Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in

Selecting close matches from one array based on another reference array

阅读更多关于 Selecting close matches from one array based on another reference array

问题 I have an array A and a reference array B . Size of A is at least as big as B . e.g. A = [2,100,300,793,1300,1500,1810,2400] B = [4,305,789,1234,1890] B is in fact the position of peaks in a signal at a specified time, and A contains position of peaks at a later time. But some of the elements in A are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A based on B . The 'real' elements in A should be close to those in B , and in the example given

how to compute similarity between two strings in MYSQL

阅读更多关于 how to compute similarity between two strings in MYSQL

if i have two strings in mysql: @a="Welcome to Stack Overflow" @b=" Hello to stack overflow"; is there a way to get the similarity percentage between those two string using MYSQL? here for example 3 words are similar and thus the similarity should be something like: count(similar words between @a and @b) / (count(@a)+count(@b) - count(intersection)) and thus the result is 3/(4 + 4 - 3)= 0.6 any idea is highly appreciated! Alaa you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552 ): CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

阅读更多关于 What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

问题 Given a sparse matrix listing, what\'s the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times. Say the input matrix is: A= [0 1 0 0 1 0 0 1 1 1 1 1 0 1 0] The sparse representation is: A = 0, 1 0, 4 1, 2 1, 3 1, 4 2, 0 2, 1 2, 3 In Python, it\'s straightforward to work with the matrix-input format: import numpy as np from sklearn.metrics import pairwise_distances from scipy.spatial.distance import

Algorithm to find articles with similar text

阅读更多关于 Algorithm to find articles with similar text

问题 I have many articles in a database (with title,text), I\'m looking for an algorithm to find the X most similar articles, something like Stack Overflow\'s \"Related Questions\" when you ask a question. I tried googling for this but only found pages about other \"similar text\" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in \"real time\" on text that I just typed. How? 回答1: Edit distance isn't a likely candidate, as it

Find cosine similarity between two arrays

阅读更多关于 Find cosine similarity between two arrays

问题 I\'m wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can\'t help but think that R should already come with one. 回答1: These sort of questions come up all the time (for me--and as evidenced by the r -tagged SO question list--others as well): is there a function, either in R core or in any R Package, that does x? and if so, where can i find it among the +2000 R Packages in

get cosine similarity between two documents in lucene

阅读更多关于 get cosine similarity between two documents in lucene

问题 i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index. For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4); How can i get the cosine similarity between these two documents? Thank you 回答1: When indexing, there's an option to store term frequency vectors. During runtime, look up the

Finding similar strings with PostgreSQL quickly

阅读更多关于 Finding similar strings with PostgreSQL quickly

问题 I need to create a ranking of similar strings in a table. I have the following table create table names ( name character varying(255) ); Currently, I\'m using pg_trgm module which offers the similarity function, but I have an efficiency problem. I created an index like the Postgres manual suggests: CREATE INDEX trgm_idx ON names USING gist (name gist_trgm_ops); and I\'m executing the following query: select (similarity(n1.name, n2.name)) as sim, n1.name, n2.name from names n1, names n2 where