Best machine learning technique for matching product strings

前端 未结 3 1456
说谎
说谎 2020-12-23 12:50

Here\'s a puzzle...

I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the prod

3条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-23 13:42

    Use a large set of training examples. For each possible pair in this example set:

    1. Parse the string for its components, viz. company, size_desc, display_type, make and so on.
    2. Find the distance between the same components between the two strings of a pair.
    3. Create a tuple of numbers representing the distance between the components.
    4. Label the tuple as identical/non-identical based on the strings in the pair as part of the training set.
    5. Feed the tuples and train a binary classifier (SVM).

    Now, when you get a pair of strings for which you want to decide if they are same or not, extract the features like you did in the training set and create the tuple of numbers for the distance between the various components of the string. Feed the tuple to the trained SVM and classify if they are same or not.

    The advantage of using a learning approach like this is that you don't have to keep modifying the rules over and over again, and also the system learns the differences between a large pair of products that are same and different.

    You could use LibSVM package in WEKA to do this.

提交回复
热议问题