Fuzzy matching of product names

前端未结

关注

 11  1321

长发绾君心 2020-12-12 16:28

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.

For example \"

11条回答

春和景丽 (楼主)

2020-12-12 17:07

edg's answer is in the right direction, I think - you need to distinguish key words from fluff.

Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.

If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?

You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."

Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854

Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.

0 讨论(0)

查看其它11个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复