How to compare almost similar Strings in Java? (String distance measure) [closed]

本秂侑毒 提交于 2019-11-27 17:31:24
Joey

The Levensthein distance is a measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same.

The algorithm is available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class library.

Wikipedia has some more algorithms that measure similarity of strings.

The following Java libraries offer multiple compare algorithms (Levenshtein,Jaro Winkler,...):

  1. Apache Commons Lang 3: https://commons.apache.org/proper/commons-lang/
  2. Simmetrics: http://sourceforge.net/projects/simmetrics/

Both libraries have a java documentation (Apache Commons Lang Javadoc,Simmetrics Javadoc).

//Usage of Apache Commons Lang 3
import org.apache.commons.lang3.StringUtils;   
public double compareStrings(String stringA, String stringB) {
    return StringUtils.getJaroWinklerDistance(stringA, stringB);
}

 //Usage of Simmetrics
import uk.ac.shef.wit.simmetrics.similaritymetrics.JaroWinkler    
public double compareStrings(String stringA, String stringB) {
    JaroWinkler algorithm = new JaroWinkler();
    return algorithm.getSimilarity(stringA, stringB);
}

yeah thats a good metric, you could use StringUtil.getLevenshteinDistance() from apache commons

You can find implementations of Levenshtein and other string similarity/distance measures on https://github.com/tdebatty/java-string-similarity

If your project uses maven, installation is as simple as

<dependency>
  <groupId>info.debatty</groupId>
  <artifactId>java-string-similarity</artifactId>
  <version>RELEASE</version>
</dependency>

Then, to use Levenshtein for example

import info.debatty.java.stringsimilarity.*;

public class MyApp {

  public static void main (String[] args) {
    Levenshtein l = new Levenshtein();

    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
    System.out.println(l.distance("My string", "My $tring"));
  }
}

Shameless plug, but I wrote a library also:

https://github.com/vickumar1981/stringdistance

It has all these functions, plus a few for phonetic similarity (if one word "sounds like" another word - returns either true or false unlike the other fuzzy similarities which are numbers between 0-1).

Also includes dna sequencing algorithms like Smith-Waterman and Needleman-Wunsch which are generalized versions of Levenshtein.

I plan, in the near future, on making this work with any array and not just strings (an array of characters).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!