Pandas split on regex

后端 未结 3 382
深忆病人
深忆病人 2020-12-19 05:57

I have pandas df with a column containing comma-delimited characteristics like so:

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide         


        
相关标签:
3条回答
  • 2020-12-19 06:13

    Yes, split supports regex. According to your requirements,

    split based on a regex match of a comma followed by a space and a capital letter

    you may use

    df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)
    

    See the regex demo.

    Details

    • \s*,\s* - a comma enclosed with 0+ whitespaces
    • (?=[A-Z]) - only if followed with an uppercase ASCII letter

    However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\)) lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than ( and ) and then a ):

    r'\s*,\s*(?=[A-Z])(?![^()]*\))'
    

    and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).

    See another regex demo.

    0 讨论(0)
  • 2020-12-19 06:14

    You can try .str.extractall (but I think there are better patterns than mine).

    import pandas as pd
    
    txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
    df = pd.DataFrame({'incident_characteristics': [txt]})
    df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]
    

    Output:

    #    match
    # 0  0                                   Shot - Wounded/Injured
    #    1                Shot - Dead (murder, accidental, suicide)
    #    2                                        Suicide - Attempt
    #    3                                           Murder/Suicide
    #    4         Attempted Murder/Suicide (one variable unsucc...
    #    5                               Institution/Group/Business
    #    6         Mass Murder (4+ deceased victims excluding th...
    #    7         Mass Shooting (4+ victims injured or killed e...
    # Name: 0, dtype: object
    

    If you use .str.split, the first letter will be removed as it is used as a part of delimiter.

    df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)
    

    Output:

    #                         0                                         1  \
    # 0  Shot - Wounded/Injured  hot - Dead (murder, accidental, suicide)
    #                   2              3  \
    # 0  uicide - Attempt  urder/Suicide
    #                                                    4  \
    # 0  ttempted Murder/Suicide (one variable unsucces...
    #                            5  \
    # 0  nstitution/Group/Business
    #                                                    6  \
    # 0  ass Murder (4+ deceased victims excluding the ...
    #                                                    7
    # 0  ass Shooting (4+ victims injured or killed exc...
    
    0 讨论(0)
  • 2020-12-19 06:35

    I would first create the data and then feed it into a dataframe, like so

    import pandas as pd, re
    
    junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""
    
    rx = re.compile(r'\([^()]+\)|,(\s+)')
    
    data = [x 
            for nugget in rx.split(junk) if nugget
            for x in [nugget.strip()] if x]
    
    df = pd.DataFrame({'incident_characteristics': data})
    print(df)
    

    This yields

                                incident_characteristics
    0                             Shot - Wounded/Injured
    1                                        Shot - Dead
    2                                  Suicide - Attempt
    3                                     Murder/Suicide
    4                           Attempted Murder/Suicide
    5                         Institution/Group/Business
    6                                        Mass Murder
    7  Mass Shooting (4+ victims injured or killed ex...
    

    Additionally, this assumes that commas in parentheses should be ignored when splitting.

    0 讨论(0)
提交回复
热议问题