How to extract specific content in a pandas dataframe with a regex?

前端 未结 3 1876
Happy的楠姐
Happy的楠姐 2020-12-07 23:35

Consider the following pandas dataframe:

In [114]:

df[\'movie_title\'].head()

​
Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1         


        
相关标签:
3条回答
  • 2020-12-07 23:55

    You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex and strip leading and trailing whitespaces:

    #convert column to string
    df['movie_title'] = df['movie_title'].astype(str)
    
    #but it remove numbers in names of movies too
    df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
    df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
    df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
    print df
              movie_title      titles      titles1      titles2
    0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
    1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
    2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
    3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
    4      Copycat (1995)     Copycat      Copycat      Copycat
    
    0 讨论(0)
  • 2020-12-08 00:15

    Using regular expressions to find a year stored between parentheses. We specify the parantheses so we don't conflict with movies that have years in their titles

    movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
    

    Removing the parentheses:

    movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
    

    Removing the years from the 'title' column:

    movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
    

    Applying the strip function to get rid of any ending whitespace characters that may have appeared:

    movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
    
    0 讨论(0)
  • 2020-12-08 00:16

    You should assign text group(s) with () like below to capture specific part of it.

    new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
    new_df['just_movie_titles']
    

    pandas.core.strings.StringMethods.extract

    StringMethods.extract(pat, flags=0, **kwargs)

    Find groups in each string using passed regular expression

    0 讨论(0)
提交回复
热议问题