Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df

前端 未结 3 1574
粉色の甜心
粉色の甜心 2020-12-02 17:45

I\'d like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here\'

相关标签:
3条回答
  • 2020-12-02 17:59

    This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.

    # Create Dataframe
    df = pd.DataFrame(
        {'FirstName': ['Tom','Tom','David','Alex','Alex'],
        'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
        })
    
    newIdName = 'id'   # Set new name here.
    
    df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes     
    

    Output:

    >>> df
              FirstName  LastName  id
            0       Tom     Jones   0
            1       Tom     Jones   0
            2     David     Smith   1
            3      Alex  Thompson   2
            4      Alex  Thompson   2
    
    0 讨论(0)
  • 2020-12-02 18:04

    This approach uses .groupby() and .ngroup() (new in Pandas 0.20.2) to create the id column:

    df['id'] = df.groupby(['LastName','FirstName']).ngroup()
    >>> df
    
       First    Second  id
    0    Tom     Jones   0
    1    Tom     Jones   0
    2  David     Smith   1
    3   Alex  Thompson   2
    4   Alex  Thompson   2
    

    I checked timings and, for the small dataset in this example, Alexander's answer is faster:

    %timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
    1000 loops, best of 3: 848 µs per loop
    
    %timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
    1000 loops, best of 3: 1.22 ms per loop
    

    However, for larger dataframes, the groupby() approach appears to be faster. To create a large, representative data set, I used faker to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.

    import faker
    fakenames = faker.Faker()
    first = [ fakenames.first_name() for _ in range(5000) ]
    last = [ fakenames.last_name() for _ in range(5000) ]
    df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
    df2 = pd.concat([df2, df2.iloc[:2000]])
    

    Running the timing on this larger data set gives:

    %timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
    100 loops, best of 3: 5.22 ms per loop
    
    %timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
    100 loops, best of 3: 3.1 ms per loop
    

    You may want to test both approaches on your data set to determine which one works best given the size of your data.

    0 讨论(0)
  • 2020-12-02 18:20

    You could join the last name and first name, convert it to a category, and then get the codes.

    Of course, multiple people with the same name would have the same id.

    df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
    >>> df
      FirstName  LastName  id
    0       Tom     Jones   0
    1       Tom     Jones   0
    2     David     Smith   1
    3      Alex  Thompson   2
    4      Alex  Thompson   2
    
    0 讨论(0)
提交回复
热议问题