Fuzzy matching of product names

前端 未结 11 1321
长发绾君心
长发绾君心 2020-12-12 16:28

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.

For example \"

11条回答
  •  春和景丽
    2020-12-12 17:07

    edg's answer is in the right direction, I think - you need to distinguish key words from fluff.

    Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.

    If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?

    You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."

    Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854

    Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.

提交回复
热议问题