Matching inexact company names in Java

本秂侑毒 提交于 2019-12-21 09:24:29

问题


I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.

For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".

Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.

The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.

How would you do this in Java?


回答1:


You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.

You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)




回答2:


Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:

https://code.google.com/p/java-similarities/

If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.

(BTW, I'm not the author of the library, so kudos for its creators.)




回答3:


You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

  • LCS code
  • Example usage (guessing a category based on what people entered)



回答4:


I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.




回答5:


Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.




回答6:


Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):

http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp

You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.

You can also use the regex library in Java

JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html

Using Regular Expressions in Java
http://www.regular-expressions.info/java.html

The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/

You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex




回答7:


vote up 1 vote down

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

* LCS code
* Example usage (guessing a category based on what people entered)

to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.




回答8:


You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.



来源:https://stackoverflow.com/questions/322701/matching-inexact-company-names-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!