Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

前端未结

关注

 3  1574

I\'d like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here\'

相关标签:

3条回答

春和景丽

2020-12-02 17:59

This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.

# Create Dataframe
df = pd.DataFrame(
    {'FirstName': ['Tom','Tom','David','Alex','Alex'],
    'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
    })

newIdName = 'id'   # Set new name here.

df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes

Output:

>>> df
          FirstName  LastName  id
        0       Tom     Jones   0
        1       Tom     Jones   0
        2     David     Smith   1
        3      Alex  Thompson   2
        4      Alex  Thompson   2

0 讨论(0)

一整个雨季

2020-12-02 18:04

This approach uses .groupby() and .ngroup() (new in Pandas 0.20.2) to create the id column:

df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df

   First    Second  id
0    Tom     Jones   0
1    Tom     Jones   0
2  David     Smith   1
3   Alex  Thompson   2
4   Alex  Thompson   2

I checked timings and, for the small dataset in this example, Alexander's answer is faster:

%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 µs per loop

%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop

However, for larger dataframes, the groupby() approach appears to be faster. To create a large, representative data set, I used faker to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])

Running the timing on this larger data set gives:

%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop

%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop

You may want to test both approaches on your data set to determine which one works best given the size of your data.

0 讨论(0)

梦谈多话

2020-12-02 18:20

You could join the last name and first name, convert it to a category, and then get the codes.

Of course, multiple people with the same name would have the same id.

df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
  FirstName  LastName  id
0       Tom     Jones   0
1       Tom     Jones   0
2     David     Smith   1
3      Alex  Thompson   2
4      Alex  Thompson   2

0 讨论(0)