Anonymizing data / replacing names

◇◆丶佛笑我妖孽 提交于 2019-12-24 14:13:49

问题


Normally I anonymize my data by using hashlib and using the .apply(hash) function.

Now im trying a new approach, imagine I have to following df called 'data':

contributor -- amount payed
eric -- 10
frank -- 28
john -- 49
frank -- 77
barbara -- 31

Which I want to anonymize by turning the names all into 'person1', 'person2' etc, like this:

contributor -- amount payed
person1 -- 10
person2 -- 28
person3 -- 49
person2 -- 77
person4 -- 31

So my first though was summarizing the name column so the names are attached to a unique index and I an use that index for the number after 'person'.

So now im stuck at the part how do I iterate through my data.name column and look in the summarize dataframe for the index and replace the actual name by 'person3' for example.

my code so far

counter = 0
for names in data.contributor:
    if names == summarize.contributor[counter]:
         print(summarize.contributor[counter])
         data.contributor.replace(summarize.contributor[counter], "Person %d" % counter)
    counter = counter + 1

my thought was to put the names in a list + index, but I guess theres a faster way. Searching for 'Anthony' was just a test to see if my code was working.


回答1:


I think faster solution is use factorize for unique values, add 1, convert to Series and strings and prepend Person string:

df['contributor'] = 'Person' + pd.Series(pd.factorize(df['contributor'])[0] + 1).astype(str)
print (df)
  contributor  amount payed
0     Person1            10
1     Person2            28
2     Person3            49
3     Person2            77
4     Person4            31



回答2:


Maybe try to create a data frame called "index" for this operation and keep unique name values inside it?

Then produce masks with unique name indexes and merge the resulting data frame indexwith data.

index = pd.DataFrame()
index['name'] = df['name'].unique()
index['mask'] = index['name'].apply(lambda x : 'person' + 
str(index[index.name == x].index[0] + 1))

data.merge(index, how='left')[['mask', 'amount']]



回答3:


labels, uniques =  pd.factorize(df['name'])
labels = ['person_'+str(l) for l in labels]
df['contributor_anonymized'] = labels


来源:https://stackoverflow.com/questions/49309060/anonymizing-data-replacing-names

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!