similarity | 易学教程

String similarity in PHP: levenshtein like function for long strings

阅读更多关于 String similarity in PHP: levenshtein like function for long strings

问题 The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP. Basically I have a database of sentences, and I want to find approximate duplicates. similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below: $ss="Jack is a very nice boy, isn't he?"; $pp="jack is a very nice boy is he"; $ss=strtolower($ss); // convert to lower case as we

Hamming Distance / Similarity searches in a database

阅读更多关于 Hamming Distance / Similarity searches in a database

问题 I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints. I intend to store these in a sql database (maybe a nosql db) in the future However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes. Any Ideas? 回答1: A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming

Python: Semantic similarity score for Strings [duplicate]

阅读更多关于 Python: Semantic similarity score for Strings [duplicate]

问题 This question already has an answer here: How to compute the similarity between two text documents? 9 answers Are there any libraries for computing semantic similarity scores for a pair of sentences ? I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are. I found a work in

Algorithm to find articles with similar text

阅读更多关于 Algorithm to find articles with similar text

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question. I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in "real time" on text that I just typed. How? Edit distance isn't a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe,

String similarity -> Levenshtein distance

阅读更多关于 String similarity -> Levenshtein distance

问题 I'm using the Levenshtein algorithm to find the similarity between two strings. This is a very important part of the program I'm making, so it needs to be effective. The problem is that the algorithm doesn't find the following examples as similar: CONAIR AIRCON The algorithm will give a distance of 6. So for this word of 6 letters (You look at the word with the highest amount of letters), the difference is of 100% => the similarity is 0%. I need to find a way to find the similarities between

Check if two NSStrings are similar

阅读更多关于 Check if two NSStrings are similar

问题 I present a tricky question that I am not sure how to approach. So, I have formulated a plist containing dictionaries which contain two objects: The Country Name The Plug Size Of The Country There are only 210 countries/facts though. And, I have enabled to search through a list of many many countries, in which there might be a fact or not. But here is my problem, I am using a web service called Geonames and the user can use a search bar display controller to search for countries, and these

Wordnet Similarity in Java: JAWS, JWNL or Java WN::Similarity?

阅读更多关于 Wordnet Similarity in Java: JAWS, JWNL or Java WN::Similarity?

问题 I need to use Wordnet in a java-based app. I want to: search synsets find similarity/relatedness between synsets My app uses RDF graphs and I know there are SPARQL endpoints with Wordnet, but I guess it's better to have a local copy of the dataset, as it's not too big. I've found the following jars: General library - JAWS http://lyle.smu.edu/~tspell/jaws/index.html General library - JWNL http://sourceforge.net/projects/jwordnet Similarity library (Perl) - Wordnet::similarity http://wn

Similarity Score - Levenshtein

阅读更多关于 Similarity Score - Levenshtein

问题 I implemented the Levenshtein algorithm in Java and am now getting the corrections made by the algorithm, a.k.a. the cost. This does help a little but not much since I want the results as a percentage. So I want to know how to calculate those similarity points. I would also like to know how you people do it and why. 回答1: The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

阅读更多关于 What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times. Say the input matrix is: A= [0 1 0 0 1 0 0 1 1 1 1 1 0 1 0] The sparse representation is: A = 0, 1 0, 4 1, 2 1, 3 1, 4 2, 0 2, 1 2, 3 In Python, it's straightforward to work with the matrix-input format: import numpy as np from sklearn.metrics import pairwise_distances from scipy.spatial.distance import cosine A = np.array( [[0, 1, 0, 0, 1], [0, 0, 1, 1, 1], [1, 1, 0, 1, 0]]) dist_out = 1-pairwise_distances(A,

get cosine similarity between two documents in lucene

阅读更多关于 get cosine similarity between two documents in lucene

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index. For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4); How can i get the cosine similarity between these two documents? Thank you When indexing, there's an option to store term frequency vectors. During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency