similarity | 易学教程

String similarity with Python + Sqlite (Levenshtein distance / edit distance)

阅读更多关于 String similarity with Python + Sqlite (Levenshtein distance / edit distance)

问题 Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module? Example of use case: import sqlite3 conn = sqlite3.connect(':memory:') c = conn.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")') c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")') This query should match the row with ID 1, but not the row with ID 2: c.execute('SELECT * FROM mytable

Comparing strings with tolerance

阅读更多关于 Comparing strings with tolerance

问题 I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on. Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this. 回答1: You could use the Levenshtein Distance algorithm. "The Levenshtein distance

Java library to compare image similarity [closed]

阅读更多关于 Java library to compare image similarity [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I spent quite some time researching for a library that allows me to compare images to one another in Java. I didn't really find anything useful, maybe my GoogleSearch-skill isn't high enough so I thought I'd ask you guys if you could point me into a direction of where I could find something like this. Basically

How to calculate distance similarity measure of given 2 strings?

阅读更多关于 How to calculate distance similarity measure of given 2 strings?

问题 I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example: The real word: hospital Mistaken word: haspita Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%. How can I achieve this with performance being a

Cluster similar curves considering “belongingness”?

阅读更多关于 Cluster similar curves considering “belongingness”?

问题 Currently, I have 6 curves shown in 6 different colors as below. The 6 curves are in fact generated by 6 trials of one same experiment . That means, ideally they should be the same curve, but due to the noise and different trial participants, they just look similar but not exactly the same. Now I wish to create an algorithm that is able to identify that the 6 curves are essentially the same and cluster them together into one cluster. What similarity metrics should I use? Note: The x-axis does

How to calculate similarity with google-diff-match-patch C# library?

阅读更多关于 How to calculate similarity with google-diff-match-patch C# library?

问题 I use google-diff-match-patch C# library. I want to measure the similarity between two texts. To do this I make this C# code : List<DiffMatchPatch.Diff> lDiffs = dmpDiff.diff_main(sTexte1, sTexte2); int iIndex = dmpDiff.diff_levenshtein(lDiffs); double dsimilarity = 100 - ((double)iIndex / Math.Max(sTexte1.Length, sTexte2.Length) * 100); With similarity values between 0 - 100 (0 == perfect match - 100 == totaly different). Do you think this is a good approach, that this calculation is correct

Comparing similarity between multiple strings with a random starting point

阅读更多关于 Comparing similarity between multiple strings with a random starting point

问题 I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M@rrrrryy Richard etc etc. Some typos but some totally different names. Initially, I want to display only 3 (or a similar small number) of the names

How to find that one text is similar to the part of another?

阅读更多关于 How to find that one text is similar to the part of another?

问题 We know how to make an assessment of the similarity of two whole texts for example by Word Mover’s Distance. How to find piece inside one text that is similar to another text? 回答1: You could break the text into chunks – ideally by natural groupings, like sentences or paragraphs – then do pairwise comparisons of every chunk against every other, using some text-distance measure. Word Mover's Distance can give impressive results, but it quite slow/expensive to calculate, especially for large

Best way to pull similar items from a MySQL database

阅读更多关于 Best way to pull similar items from a MySQL database

问题 I have a database of items (articles for that matter). What I'd like to do, is I'd like to pull X items that are similar to a specific item, based on two things - title, which is the title of the article, and tags, which are located in another table. The structure is as follows (relevant fields only): Table: article Fields: articleid, title Table: tag Fields: tagid, tagtext Table: articletag Fields: tagid, articleid What would be the best way to do this? 回答1: For myisam tables you can use

Find semantically similar word for natural language processing

阅读更多关于 Find semantically similar word for natural language processing

问题 I am working on a natural language processing project in Java. I have a requirement where I want to identify words that belong to similar semantic groups. e.g. : if the words such as study , university , graduate , attend are found I want them to be categorized as being related to education. If words such as golfer , batsman , athlete are found, it should categorize all under a parent like sportsperson. Is there a way I can achieve this task without using and training approach. Is there some