Improving search result using Levenshtein distance in Java

前端 未结 5 1416
南方客
南方客 2021-01-31 03:08

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {
    privat         


        
5条回答
  •  野性不改
    2021-01-31 03:52

    Since you asked, I'll show how the UMBC semantic network can do at this kind of thing. Not sure it's what you really want:

    import static java.lang.String.format;
    import static java.util.Comparator.comparingDouble;
    import static java.util.stream.Collectors.toMap;
    import static java.util.function.Function.identity;
    
    import java.util.Map.Entry;
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.Arrays;
    import java.util.regex.Pattern;
    
    public class SemanticSimilarity {
      private static final String GET_URL_FORMAT
          = "http://swoogle.umbc.edu/SimService/GetSimilarity?"
              + "operation=api&phrase1=%s&phrase2=%s";
      private static final Pattern VALID_WORD_PATTERN = Pattern.compile("\\w+");
      private static final String[] DICT = {
        "cat",
        "building",
        "girl",
        "ranch",
        "drawing",
        "wool",
        "gear",
        "question",
        "information",
        "tank" 
      };
    
      public static String httpGetLine(String urlToRead) throws IOException {
        URL url = new URL(urlToRead);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        conn.setRequestMethod("GET");
        try (BufferedReader reader = new BufferedReader(
            new InputStreamReader(conn.getInputStream()))) {
          return reader.readLine();
        }
      }
    
      public static double getSimilarity(String a, String b) {
        if (!VALID_WORD_PATTERN.matcher(a).matches()
            || !VALID_WORD_PATTERN.matcher(b).matches()) {
          throw new RuntimeException("Bad word");
        }
        try {
          return Double.parseDouble(httpGetLine(format(GET_URL_FORMAT, a, b)));
        } catch (IOException | NumberFormatException ex) {
          return -1.0;
        }
      }
    
      public static void test(String target) throws IOException {
        System.out.println("Target: " + target);
        Arrays.stream(DICT)
            .collect(toMap(identity(), word -> getSimilarity(target, word)))
            .entrySet().stream()
            .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
            .forEach(System.out::println);
        System.out.println();
      }
    
      public static void main(String[] args) throws Exception {
        test("sheep");
        test("vehicle");
        test("house");
        test("data");
        test("girlfriend");
      }
    }
    

    The results are kind of fascinating:

    Target: sheep
    ranch=0.38563728
    cat=0.37816614
    wool=0.36558008
    question=0.047607
    girl=0.0388761
    information=0.027191084
    drawing=0.0039623436
    tank=0.0
    building=0.0
    gear=0.0
    
    Target: vehicle
    tank=0.65860236
    gear=0.2673374
    building=0.20197356
    cat=0.06057514
    information=0.041832563
    ranch=0.017701812
    question=0.017145569
    girl=0.010708235
    wool=0.0
    drawing=0.0
    
    Target: house
    building=1.0
    ranch=0.104496084
    tank=0.103863
    wool=0.059761923
    girl=0.056549154
    drawing=0.04310725
    cat=0.0418914
    gear=0.026439993
    information=0.020329408
    question=0.0012588014
    
    Target: data
    information=0.9924584
    question=0.03476312
    gear=0.029112043
    wool=0.019744944
    tank=0.014537057
    drawing=0.013742204
    ranch=0.0
    cat=0.0
    girl=0.0
    building=0.0
    
    Target: girlfriend
    girl=0.70060706
    ranch=0.11062875
    cat=0.09766617
    gear=0.04835723
    information=0.02449007
    wool=0.0
    question=0.0
    drawing=0.0
    tank=0.0
    building=0.0
    

提交回复
热议问题