Matching inexact company names in Java

六眼飞鱼酱① 提交于 2019-12-04 03:12:53

You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.

You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)

Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:

https://code.google.com/p/java-similarities/

If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.

(BTW, I'm not the author of the library, so kudos for its creators.)

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.

Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.

Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):

http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp

You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.

You can also use the regex library in Java

JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html

Using Regular Expressions in Java
http://www.regular-expressions.info/java.html

The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/

You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

vote up 1 vote down

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

* LCS code
* Example usage (guessing a category based on what people entered)

to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.

You could use Lucene to index your database, then query the Lucene index. There are a number of search engines built on top of Lucene, including Solr.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!