Create Pandas DataFrame from txt file with specific pattern

前端 未结 6 2442
北海茫月
北海茫月 2020-11-22 09:04

I need to create a Pandas DataFrame based on a text file based on the following structure:

Alabama[edit]
Auburn (Aubu         


        
6条回答
  •  故里飘歌
    2020-11-22 09:17

    TL;DR
    s.groupby(s.str.extract('(?P.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]


    regex = '(?P.*?)\[edit\]'  # pattern to match
    print(s.groupby(
        # will get nulls where we don't have "[edit]"
        # forward fill fills in the most recent line
        # where we did have an "[edit]"
        s.str.extract(regex, expand=False).ffill()  
    ).apply(
        # I still have all the original values
        # If I group by the forward filled rows
        # I'll want to drop the first one within each group
        pd.Series.tail, n=-1
    ).reset_index(
        # munge the dataframe to get columns sorted
        name='Region_Name'
    )[['State', 'Region_Name']])
    
          State                                        Region_Name
    0   Alabama                      Auburn (Auburn University)[1]
    1   Alabama             Florence (University of North Alabama)
    2   Alabama    Jacksonville (Jacksonville State University)[2]
    3   Alabama         Livingston (University of West Alabama)[2]
    4   Alabama           Montevallo (University of Montevallo)[2]
    5   Alabama                          Troy (Troy University)[2]
    6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
    7   Alabama                  Tuskegee (Tuskegee University)[5]
    8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
    9   Arizona         Flagstaff (Northern Arizona University)[6]
    10  Arizona                   Tempe (Arizona State University)
    11  Arizona                     Tucson (University of Arizona)
    

    setup

    txt = """Alabama[edit]
    Auburn (Auburn University)[1]
    Florence (University of North Alabama)
    Jacksonville (Jacksonville State University)[2]
    Livingston (University of West Alabama)[2]
    Montevallo (University of Montevallo)[2]
    Troy (Troy University)[2]
    Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
    Tuskegee (Tuskegee University)[5]
    Alaska[edit]
    Fairbanks (University of Alaska Fairbanks)[2]
    Arizona[edit]
    Flagstaff (Northern Arizona University)[6]
    Tempe (Arizona State University)
    Tucson (University of Arizona)
    Arkansas[edit]"""
    
    s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)
    

提交回复
热议问题