string-comparison

How do I convert between a measure of similarity and a measure of difference (distance)?

随声附和 提交于 2019-12-03 03:07:28
Is there a general way to convert between a measure of similarity and a measure of distance? Consider a similarity measure like the number of 2-grams that two strings have in common. 2-grams('beta', 'delta') = 1 2-grams('apple', 'dappled') = 4 What if I need to feed this to an optimization algorithm that expects a measure of difference, like Levenshtein distance? This is just an example...I'm looking for a general solution, if one exists. Like how to go from Levenshtein distance to a measure of similarity? I appreciate any guidance you may offer. Let d denotes distance, s denotes similarity.

Is there any way to sort strings in all languages?

瘦欲@ 提交于 2019-12-03 02:28:32
I have this code. It sorts correctly in French and Russian. I used Locale.US and it seems to be right. Is this solution do right with all languages out there? Does it work with other languages? For example: Chinese, Korean, Japanese... If not, what is the better solution? public class CollationTest { public static void main(final String[] args) { final Collator collator = Collator.getInstance(Locale.US); final SortedSet<String> set = new TreeSet<String>(collator); set.add("abîmer"); set.add("abîmé"); set.add("aberrer"); set.add("abhorrer"); set.add("aberrance"); set.add("abécédaire"); set.add(

Regex to compare string and see where is the differece

好久不见. 提交于 2019-12-02 22:26:13
问题 I am creating a regex to see if the copyright info at the top of all documents is formated correctly. The copy right is long therefore my regex is long too. Lets say that the copy right info looks like: /*///////////////////////////////////////////////////////////////////////// Copyright content which is a lot goes in here. Programmer: Tono Nam /////////////////////////////////////////////////////////////////////////*/ Then I will use the regex: var pattern = @"/\*////////////////////////////

Is it possible to compare rows for similar data in SQL server

和自甴很熟 提交于 2019-12-02 21:12:12
问题 Is it possible to compare rows for similar data in SQL Server? I have a company name column in a table where company names could be somewhat similar. Here is an example of the different 8 values that represent the same 4 companies: ANDORRA WOODS ANDORRA WOODS HEALTHCARE CENTER ABC HEALTHCARE, JOB #31181 ABC HEALTHCARE, JOB #31251 ACTION SERVICE SALES, A SUBSIDIARY OF SINGER EQUIPMENT ACTION SERVICE SALES, A SUBSIDIARY OF SINGER EQUIPMENT COMPANY APEX SYSTEMS APEX SYSTEMS, INC The way I clean

What are some good methods to find the “relatedness” of two bodies of text?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-02 19:45:58
Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information. What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc? I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime. I just thought I would ask the Stack Overflow community

c# loop until Console.ReadLine = 'y' or 'n'

我与影子孤独终老i 提交于 2019-12-02 18:46:19
问题 I'm fairly new to c#, and writing a simple console app as practice. I want the application to ask a question, and only progress to the next piece of code when the user input equals 'y' or 'n'. Here's what I have so far. static void Main(string[] args) { string userInput; do { Console.WriteLine("Type something: "); userInput = Console.ReadLine(); } while (string.IsNullOrEmpty(userInput)); Console.WriteLine("You typed " + userInput); Console.ReadLine(); string wantCount; do { Console.WriteLine(

Regex to compare string and see where is the differece

随声附和 提交于 2019-12-02 11:44:40
I am creating a regex to see if the copyright info at the top of all documents is formated correctly. The copy right is long therefore my regex is long too. Lets say that the copy right info looks like: /*///////////////////////////////////////////////////////////////////////// Copyright content which is a lot goes in here. Programmer: Tono Nam /////////////////////////////////////////////////////////////////////////*/ Then I will use the regex: var pattern = @"/\*///////////////////////////////////////////////////////////////////////// Copyright content which is a lot goes in here. Programmer

Is it possible to compare rows for similar data in SQL server

蓝咒 提交于 2019-12-02 10:08:42
Is it possible to compare rows for similar data in SQL Server? I have a company name column in a table where company names could be somewhat similar. Here is an example of the different 8 values that represent the same 4 companies: ANDORRA WOODS ANDORRA WOODS HEALTHCARE CENTER ABC HEALTHCARE, JOB #31181 ABC HEALTHCARE, JOB #31251 ACTION SERVICE SALES, A SUBSIDIARY OF SINGER EQUIPMENT ACTION SERVICE SALES, A SUBSIDIARY OF SINGER EQUIPMENT COMPANY APEX SYSTEMS APEX SYSTEMS, INC The way I clean it right now is using Google refine where I can identify clusters of similar data values and make them

c# loop until Console.ReadLine = 'y' or 'n'

别说谁变了你拦得住时间么 提交于 2019-12-02 09:41:54
I'm fairly new to c#, and writing a simple console app as practice. I want the application to ask a question, and only progress to the next piece of code when the user input equals 'y' or 'n'. Here's what I have so far. static void Main(string[] args) { string userInput; do { Console.WriteLine("Type something: "); userInput = Console.ReadLine(); } while (string.IsNullOrEmpty(userInput)); Console.WriteLine("You typed " + userInput); Console.ReadLine(); string wantCount; do { Console.WriteLine("Do you want me to count the characters present? Yes (y) or No (n): "); wantCount = Console.ReadLine();

How can I use jaro-winkler to find the closest value in a table?

丶灬走出姿态 提交于 2019-12-02 06:22:19
问题 I have an implementation of the jaro-winkler algorithm in my database. I did not write this function. The function compares two values and gives the probability of match. So jaro(string1, string2, matchnoofchars) will return a result. Instead of comparing two strings, I want to send one string with a matchnoofchars and then get a result set with the probability higher than 95%. For example the current function is able to return 97.62% for jaro("Philadelphia","Philadelphlaa",9) I wish to tweak