similarity

Wordnet Similarity in Java: JAWS, JWNL or Java WN::Similarity?

僤鯓⒐⒋嵵緔 提交于 2019-11-27 20:53:38
I need to use Wordnet in a java-based app. I want to: search synsets find similarity/relatedness between synsets My app uses RDF graphs and I know there are SPARQL endpoints with Wordnet, but I guess it's better to have a local copy of the dataset, as it's not too big. I've found the following jars: General library - JAWS http://lyle.smu.edu/~tspell/jaws/index.html General library - JWNL http://sourceforge.net/projects/jwordnet Similarity library (Perl) - Wordnet::similarity http://wn-similarity.sourceforge.net/ Java version of Wordnet::similarity http://www.cogs.susx.ac.uk/users/drh21/ (beta)

Similarity Score - Levenshtein

╄→尐↘猪︶ㄣ 提交于 2019-11-27 20:34:22
I implemented the Levenshtein algorithm in Java and am now getting the corrections made by the algorithm, a.k.a. the cost. This does help a little but not much since I want the results as a percentage. So I want to know how to calculate those similarity points. I would also like to know how you people do it and why. Ralph The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. (Wikipedia) So a Levenshtein distance of

Cosine similarity vs Hamming distance [closed]

半腔热情 提交于 2019-11-27 17:45:12
To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between " Cosine similarity " and " Hamming distance ". My question: Do you have experience with these algorithms? Which one gives you better results? In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code: function check ($terms1, $terms2) { $counts1 = array_count_values($terms1); $totalScore = 0; foreach ($terms2 as $term) { if (isset($counts1[$term])) $totalScore +=

Python: Semantic similarity score for Strings [duplicate]

ε祈祈猫儿з 提交于 2019-11-27 17:20:24
This question already has an answer here: How to compute the similarity between two text documents? 8 answers Are there any libraries for computing semantic similarity scores for a pair of sentences ? I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are. I found a work in progress that's written using the .NET framework that computes the score using an array of pre-processing

Selecting close matches from one array based on another reference array

三世轮回 提交于 2019-11-27 09:36:50
I have an array A and a reference array B . Size of A is at least as big as B . e.g. A = [2,100,300,793,1300,1500,1810,2400] B = [4,305,789,1234,1890] B is in fact the position of peaks in a signal at a specified time, and A contains position of peaks at a later time. But some of the elements in A are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A based on B . The 'real' elements in A should be close to those in B , and in the example given above, the 'real' ones in A should be A'=[2,300,793,1300,1810] . It should be obvious in this example

Solr Custom Similarity

二次信任 提交于 2019-11-27 09:18:10
i want to set my own custom similarity in my solr schema.xml but i have a few problems with understanding this feature. I want to completely deactivate solr scoring (tf,idf,coord and fieldNorm). I dont know where to start. Things i know I have to write my own DefaultSimilarity implementation. Override the (tf,idf,coord and fieldNorm) - methods. Load the class in schem.xml Where to store the class ? Are there any working examples in the web ? I cant find one! THANKS I figured it out on my own. I have stored my own implementation of DefaultSimilarity under /dist/ folder in solr. Then i add <lib

Word comparison algorithm

天涯浪子 提交于 2019-11-27 07:19:48
I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record: 1, John Doe, ACME Comapny (the typo is on purpose) Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that

String similarity with Python + Sqlite (Levenshtein distance / edit distance)

风流意气都作罢 提交于 2019-11-27 05:30:36
Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module? Example of use case: import sqlite3 conn = sqlite3.connect(':memory:') c = conn.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")') c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")') This query should match the row with ID 1, but not the row with ID 2: c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6') How to do this in Sqlite+Python? Notes about what I've

Cosine similarity vs Hamming distance [closed]

一个人想着一个人 提交于 2019-11-27 04:13:37
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 4 years ago . To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance". My question: Do you have experience with these algorithms? Which one

Mahalonobis distance in R, error: system is computationally singular

↘锁芯ラ 提交于 2019-11-27 03:20:42
问题 I'd like to calculate multivariate distance from a set of points to the centroid of those points. Mahalanobis distance seems to be suited for this. However, I get an error (see below). Can anyone tell me why I am getting this error, and if there is a way to work around it? If you download the coordinate data and the associated environmental data, you can run the following code. require(maptools) occ <- readShapeSpatial('occurrences.shp') load('envDat.Rdata') #standardize the data to scale the