Pandas split on regex

后端 未结 3 386
深忆病人
深忆病人 2020-12-19 05:57

I have pandas df with a column containing comma-delimited characteristics like so:

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide         


        
3条回答
  •  清歌不尽
    2020-12-19 06:35

    I would first create the data and then feed it into a dataframe, like so

    import pandas as pd, re
    
    junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""
    
    rx = re.compile(r'\([^()]+\)|,(\s+)')
    
    data = [x 
            for nugget in rx.split(junk) if nugget
            for x in [nugget.strip()] if x]
    
    df = pd.DataFrame({'incident_characteristics': data})
    print(df)
    

    This yields

                                incident_characteristics
    0                             Shot - Wounded/Injured
    1                                        Shot - Dead
    2                                  Suicide - Attempt
    3                                     Murder/Suicide
    4                           Attempted Murder/Suicide
    5                         Institution/Group/Business
    6                                        Mass Murder
    7  Mass Shooting (4+ victims injured or killed ex...
    

    Additionally, this assumes that commas in parentheses should be ignored when splitting.

提交回复
热议问题