Expected behavior with regular expressions with capturing-groups in pandas' `str.extract()`

前端 未结 2 2086
离开以前
离开以前 2020-12-07 04:14

I\'m trying to get a grasp on regular expressions and I came across with the one included inside the str.extract method:

movies[\'year\']=movies         


        
相关标签:
2条回答
  • 2020-12-07 05:03

    Try using this:

    movies['year']= movies['title'].str.extract('.*\((\d{4})\).*',expand=False)

    • Set expand= True if you want it to return a DataFrame or when applying multiple capturing groups.
    • A year is always composed of 4 digits. So the regex: \((\d{4})\) match any date between parentheses.
    0 讨论(0)
  • 2020-12-07 05:11

    First of all, the behavior of Pandas .str.extract() is quite expected: it returns only the capturing group contents. The pattern used with extract requires at least 1 capturing group:

    pat : string
    Regular expression pattern with capturing groups

    If you use a named capturing group, the new column will be named after the named group.

    The grep command you provided can be reduced to

    grep '\((.*)\)'
    

    as grep is capable of matching a line partially (does not require a full line match) and works on a per line basis: once a match is found the whole line is returned. To override that behavior, you may use -o switch.

    With grep, you cannot return the capturing group contents. This can be worked around with PCRE regexp powered with -P option, but it is not available on Mac, for example. sed or awk may help in those situations, too.

    0 讨论(0)
提交回复
热议问题