Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

前端 未结 3 1577
粉色の甜心
粉色の甜心 2020-12-02 17:45

I\'d like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here\'

3条回答
  •  一整个雨季
    2020-12-02 18:04

    This approach uses .groupby() and .ngroup() (new in Pandas 0.20.2) to create the id column:

    df['id'] = df.groupby(['LastName','FirstName']).ngroup()
    >>> df
    
       First    Second  id
    0    Tom     Jones   0
    1    Tom     Jones   0
    2  David     Smith   1
    3   Alex  Thompson   2
    4   Alex  Thompson   2
    

    I checked timings and, for the small dataset in this example, Alexander's answer is faster:

    %timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
    1000 loops, best of 3: 848 µs per loop
    
    %timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
    1000 loops, best of 3: 1.22 ms per loop
    

    However, for larger dataframes, the groupby() approach appears to be faster. To create a large, representative data set, I used faker to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

    import faker
    fakenames = faker.Faker()
    first = [ fakenames.first_name() for _ in range(5000) ]
    last = [ fakenames.last_name() for _ in range(5000) ]
    df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
    df2 = pd.concat([df2, df2.iloc[:2000]])
    

    Running the timing on this larger data set gives:

    %timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
    100 loops, best of 3: 5.22 ms per loop
    
    %timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
    100 loops, best of 3: 3.1 ms per loop
    

    You may want to test both approaches on your data set to determine which one works best given the size of your data.

提交回复
热议问题