How to extract specific content in a pandas dataframe with a regex?

前端 未结 3 1888
Happy的楠姐
Happy的楠姐 2020-12-07 23:35

Consider the following pandas dataframe:

In [114]:

df[\'movie_title\'].head()

​
Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1         


        
3条回答
  •  北荒
    北荒 (楼主)
    2020-12-07 23:55

    You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex and strip leading and trailing whitespaces:

    #convert column to string
    df['movie_title'] = df['movie_title'].astype(str)
    
    #but it remove numbers in names of movies too
    df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
    df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
    df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
    print df
              movie_title      titles      titles1      titles2
    0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
    1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
    2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
    3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
    4      Copycat (1995)     Copycat      Copycat      Copycat
    

提交回复
热议问题