Consider the following pandas dataframe:
In [114]:
df[\'movie_title\'].head()
Out[114]:
0 Toy Story (1995)
1 GoldenEye (1995)
2 Four Rooms (1
You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex
and strip leading and trailing whitespaces:
#convert column to string
df['movie_title'] = df['movie_title'].astype(str)
#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
movie_title titles titles1 titles2
0 Toy Story 2 (1995) Toy Story Toy Story 2 Toy Story 2
1 GoldenEye (1995) GoldenEye GoldenEye GoldenEye
2 Four Rooms (1995) Four Rooms Four Rooms Four Rooms
3 Get Shorty (1995) Get Shorty Get Shorty Get Shorty
4 Copycat (1995) Copycat Copycat Copycat
Using regular expressions to find a year stored between parentheses. We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
Removing the parentheses:
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
Removing the years from the 'title' column:
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
Applying the strip function to get rid of any ending whitespace characters that may have appeared:
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
You should assign text group(s) with ()
like below to capture specific part of it.
new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']
pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
Find groups in each string using passed regular expression