Create Pandas DataFrame from txt file with specific pattern

前端 未结 6 2472
北海茫月
北海茫月 2020-11-22 09:04

I need to create a Pandas DataFrame based on a text file based on the following structure:

Alabama[edit]
Auburn (Aubu         


        
6条回答
  •  北荒
    北荒 (楼主)
    2020-11-22 09:16

    Assuming you have the following DF:

    In [73]: df
    Out[73]:
                                                     text
    0                                       Alabama[edit]
    1                       Auburn (Auburn University)[1]
    2              Florence (University of North Alabama)
    3     Jacksonville (Jacksonville State University)[2]
    4          Livingston (University of West Alabama)[2]
    5            Montevallo (University of Montevallo)[2]
    6                           Troy (Troy University)[2]
    7   Tuscaloosa (University of Alabama, Stillman Co...
    8                   Tuskegee (Tuskegee University)[5]
    9                                        Alaska[edit]
    10      Fairbanks (University of Alaska Fairbanks)[2]
    11                                      Arizona[edit]
    12         Flagstaff (Northern Arizona University)[6]
    13                   Tempe (Arizona State University)
    14                     Tucson (University of Arizona)
    15                                     Arkansas[edit]
    

    you can use Series.str.extract() method:

    In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
    
    In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
    
    In [120]: df.State = df.State.ffill()
    
    In [121]: df
    Out[121]:
                                                     text     State   Region Name
    0                                       Alabama[edit]   Alabama           NaN
    1                       Auburn (Auburn University)[1]   Alabama        Auburn
    2              Florence (University of North Alabama)   Alabama      Florence
    3     Jacksonville (Jacksonville State University)[2]   Alabama  Jacksonville
    4          Livingston (University of West Alabama)[2]   Alabama    Livingston
    5            Montevallo (University of Montevallo)[2]   Alabama    Montevallo
    6                           Troy (Troy University)[2]   Alabama          Troy
    7   Tuscaloosa (University of Alabama, Stillman Co...   Alabama    Tuscaloosa
    8                   Tuskegee (Tuskegee University)[5]   Alabama      Tuskegee
    9                                        Alaska[edit]    Alaska           NaN
    10      Fairbanks (University of Alaska Fairbanks)[2]    Alaska     Fairbanks
    11                                      Arizona[edit]   Arizona           NaN
    12         Flagstaff (Northern Arizona University)[6]   Arizona     Flagstaff
    13                   Tempe (Arizona State University)   Arizona         Tempe
    14                     Tucson (University of Arizona)   Arizona        Tucson
    15                                     Arkansas[edit]  Arkansas           NaN
    
    In [122]: df = df.dropna()
    
    In [123]: df
    Out[123]:
                                                     text    State   Region Name
    1                       Auburn (Auburn University)[1]  Alabama        Auburn
    2              Florence (University of North Alabama)  Alabama      Florence
    3     Jacksonville (Jacksonville State University)[2]  Alabama  Jacksonville
    4          Livingston (University of West Alabama)[2]  Alabama    Livingston
    5            Montevallo (University of Montevallo)[2]  Alabama    Montevallo
    6                           Troy (Troy University)[2]  Alabama          Troy
    7   Tuscaloosa (University of Alabama, Stillman Co...  Alabama    Tuscaloosa
    8                   Tuskegee (Tuskegee University)[5]  Alabama      Tuskegee
    10      Fairbanks (University of Alaska Fairbanks)[2]   Alaska     Fairbanks
    12         Flagstaff (Northern Arizona University)[6]  Arizona     Flagstaff
    13                   Tempe (Arizona State University)  Arizona         Tempe
    14                     Tucson (University of Arizona)  Arizona        Tucson
    

提交回复
热议问题