Removing “almost duplicates” using SAS or Excel

橙三吉。 提交于 2020-01-24 05:39:06

问题


I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:

**Brand        Product**
Coca Cola    Coca Cola Light
Coca Cola    Coca Cola Lgt
Coca Cola    Cocacolalight
Coca Cola    Coca Cola Vanila
  Pepsi       Pepsi Zero
  Pepsi       Pepsi Zro

i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:

    **Brand        Product**
    Coca Cola    Coca Cola Light
    Coca Cola    Coca Cola Vanila
      Pepsi       Pepsi Zero

I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.

I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.


回答1:


I'm going to start by straight up quoting Jeff's answer :

SAS has at least a couple functions for calculating edit distance between two strings:

Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

There's also the spedis() function for comparing edit distances.

Now those are all great, but my personal favorite is the soundex() function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.

First some data:

Data HAVE;
  attrib name length=$20 alt_name length=$20;
  infile datalines dsd dlm=',' truncover;
  input name $ alt_name $;
  datalines;
Coca Cola    ,Coca Cola Light
Coca Cola    ,Coca Cola Lgt
Coca Cola    ,Cocacolalight
Coca Cola    ,Coca Cola Vanila
Pepsi        ,Pepsi Zero
Pepsi        ,Pepsi Zro
;
Run;

Get every combination of words that we want to compare, and calculate the soundex()s for eyeballing:

proc sql noprint;
  create table cartesian as
  select a.name,
         a.alt_name as alt_name1,
         b.alt_name as alt_name2,
         soundex(a.alt_name) as soundex_a,
         soundex(b.alt_name) as soundex_b
  from have a, have b
  where a.name = b.name
    and soundex(a.alt_name) eq soundex(b.alt_name)
  ;
quit;

Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.



来源:https://stackoverflow.com/questions/29373148/removing-almost-duplicates-using-sas-or-excel

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!