Data matching algorithm

前端未结

关注

 4  2039

I am currently working on a project where I a data matching algorithm needs to be implemented. An external system passes in all data it knows about a customer, and the syste

相关标签:

4条回答

佛祖请我去吃肉

2021-01-06 00:57

For inspiration, look at the Levenshtein distance algorithm. This will give you a reasonable mechanism to weight your comparisons.

I would also add that in my experience you can never match two arbitrary pieces of data into the same entity with absolute certainty. You need to present plausible matches to a user, who can then verify for sure that John Smith on 1920 E. Pine is the same person as Jon Smith on 192 East Pine Road or not.

0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2021-01-06 01:07

If you limit yourself to the address and name you can just use the harvesine formula or a spatial index if you have the geolocation. For the name you can use a trie and get only the first results, maybe 10.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2021-01-06 01:15

What about a machine learning approach. Create. Distances per item.

These become your input space. Build a training set on correctly matched custers based on these distances. Run through your favourite machine learner algo. Get your parameters for decision func which reflect strength of match. Tune. Apply to new cases. Go to the bank.

0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2021-01-06 01:21
In my experience with this sort of thing, it was actually the business people who defined the rules of what was acceptible as a match, rather than it being a technical decision. This has made sense to me, since the business ends up assuming the risk. Also, what constitutes a match can be prone to change, like if they use the system and find that too many people are being excluded.

I think that your first approach makes more sense, in that if you can match someone by name and bank account number, then you're pretty sure it's them. However, if both the name and bank info don't match, but the address, phone, and all that matches (ie. a spouse) then the scoring system might incorrectly match people. I realize it's a lot of code, but so long as you extract out the actual matching code (matchPhoneNumber method, etc), then it's fine design-wise.

I would probably take it a step further and pull out the matching into an enum and then have lists of acceptable matches. Sort of like this: interface Match { boolean matches(Customer c1, Customer c2); }
```
class BankAccountMatch implements Match
{
    public boolean matches(Customer c1, Customer c2)
    {
        return c1.getBankAccountNumber() == c2.getBankAccountNumber();
    }
}

static Match BANK_ACCOUNT_MATCH = new BankAccountMatch();

Match[][] validMatches = new Match[] [] {
        {BANK_ACCOUNT_MATCH, NAME_MATCH},
        {NAME_MATCH, ADDRESS_MATCH, FAX_MATCH}, ...
};
```
And then the code that does the validation would just iterate over the validMatches array and test them to see if one fits. I might even pull out the lists of valid matches into a config file. That all depends on the level of robustness your system needs though.
0 讨论(0)
发布评论:

提交评论
- 加载中...