Create Pandas DataFrames from Unique Values in one Column

后端 未结 3 1216
难免孤独
难免孤独 2020-12-16 08:18

I have a Pandas dataframe with 1000s of rows. and it has the Names column includes the customer names and their records. I want to create individual dataframes

相关标签:
3条回答
  • 2020-12-16 08:56

    Your current iteration overwrites x twice every time it runs: the for loop assigns a customer name to x, and then you assign a dataframe to it.

    To be able to call each dataframe later by name, try storing them in a dictionary:

    df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}
    
    df_dict['Name3']
    
    0 讨论(0)
  • 2020-12-16 09:02

    To create a dataframe for all the unique values in a column, create a dict of dataframes, as follows.

    • Creates a dict, where each key is a unique value from the column of choice and the value is a dataframe.
    • Access each dataframe as you would a standard dict (e.g. df_names['Name1'])
    • .groupby() creates a generator, which can be unpacked.
      • k is the unique values in the column and v is the data associated with each k.

    With a for-loop and .groupby:

    df_names = dict()
    for k, v in df.groupby('customer name'):
        df_names[k] = v
    

    With a Python Dictionary Comprehension

    • PEP 274 -- Dict Comprehensions

    Using .groupby

    df_names = {k: v for (k, v) in df.groupby('customer name')}
    
    • This comes from a conversation with rafaelc, who pointed out that using .groupby is faster than .unique.
      • With 6 unique values in the column, .groupby is faster, at 104 ms compared to 392 ms
      • With 26 unique values in the column, .groupby is faster, at 147 ms compared to 1.53 s.
    • Using an a for-loop is slightly faster than a comprehension, particularly for more unique column values or lots of rows (e.g. 10M).

    Using .unique:

    • Use Boolean indexing to match the unique values in the column of choice.
    df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}
    

    Testing

    • The following data was used for testing
    import pandas as pd
    import string
    import random
    
    random.seed(365)
    
    # 6 unique values
    data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
            'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}
    
    # 26 unique values
    data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
            'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}
    
    df = pd.DataFrame(data)
    
    0 讨论(0)
  • 2020-12-16 09:05

    maybe i get you wrong but

    when

    for x in customerNames:
        x = DataFrame.loc[DataFrame['customer name'] == x]
    x
    

    gives you the right output for the last list entry its because your output is out of the indent of the loop

    import pandas as pd
    
    customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
                            orient='index', columns=['customer', 'country'])
    
    customer_list = ['James', 'Jean']
    
    for x in customer_list:
        x = customer_df.loc[customer_df['customer'] == x]
        print(x)
        print('now I could append the data to something new')
    

    you get the output:

      customer country
    B    James     USA
    now I could append the data to something new
      customer country
    A     Jean  France
    now I could append the data to something new
    

    Or if you dont like loops you could go with

    import pandas as pd
    
    customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
                            orient='index', columns=['customer', 'country'])
    
    customer_list = ['James', 'Jean']
    
    
    print(customer_df[customer_df['customer'].isin(customer_list)])
    

    Output:

      customer country
    A     Jean  France
    B    James     USA
    

    df.isin is better explained under:How to implement 'in' and 'not in' for Pandas dataframe

    0 讨论(0)
提交回复
热议问题