Using name full name and maiden name strings (and birthdays) to match individuals across time

你。 提交于 2019-12-02 14:01:25

问题


I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.

Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.

Ideally, I would create a time-independent ID for each unique individual.

I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:

unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]

(I'm of course using R data.table), then merging back to the full data.

However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?

Here's a preview of the data:

       first_name     last_name       nee birth_year year
    1:     eileen      aaldxxxx     dxxxx       1977 2002
    2:     eileen      aaldxxxx     dxxxx       1977 2003
    3:      sarah        aaxxxx    gexxxx       1974 2003
    4:      kelly        aaxxxx     nxxxx       1951 2008
    5:      linda aarxxxx-gxxxx   aarxxxx       1967 2008
   ---                                                   
72008:     stacey      zwirxxxx   kruxxxx       1982 2010
72009:     stacey      zwirxxxx   kruxxxx       1982 2011
72010:     stacey      zwirxxxx   kruxxxx       1982 2012
72011:     stacey      zwirxxxx   kruxxxx       1982 2013
72012:       jill      zydoxxxx gundexxxx       1978 2002

UPDATE:

I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.

I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.

The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).

Without further ado, here's the main function for matching a pair of data sets:

get_id<-function(yr,key_from,key_to=key_from,
                 mdis,msch,mard,init,mexp,step){
  #Want to exclude anyone who is matched
  existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
  #Get the most recent prior observation of all
  #  unmatched teachers, excluding those teachers
  #  who cannot be uniquely identified by the
  #  current key setting
  unmatched<-
    full_data[.(1996:(yr-1))
              ][!teacher_id %in% existing_ids,
                .SD[.N],by=teacher_id,
                .SDcols=c(key_from,"teacher_id")
                ][,if (.N==1L) .SD,keyby=key_from
                  ][,(flags):=list(mdis,msch,mard,init,mexp,step)]
  #Merge, reset keys
  setkey(setkeyv(
    full_data,key_to)[year==yr&is.na(teacher_id),
                      (update_cols):=unmatched[.SD,update_cols,with=F]],
    year)
  full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
                                        by=id,.SDcols=update_cols]
}

Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:

get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
       mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)

The final step is to assign new IDs:

current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
  setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
                   ][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
                            teacher_id:=new_ids[.SD,add_id]],year)    

来源:https://stackoverflow.com/questions/29176114/using-name-full-name-and-maiden-name-strings-and-birthdays-to-match-individual

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!