问题
I have a one column list presenting some company names. Some of those names contain the country names (e.g., "China" in "China A1", 'Finland' in "C1 in Finland"). I want to extract their belonging countries based on the company name and a pre-defined list consisted of country names.
The original dataframe df shows like this
Company name Country
0 China A1
1 Australia-A2
2 Belgium_C1
3 C1 in Finland
4 D1 of Greece
5 E2 for Pakistan
For now, I can only come up with an inefficient method. Here is my code:
country_list = ['China','America','Greece','Pakistan','Finland','Belgium','Japan','British','Australia']
for t in country_list:
df.loc[df['company name'].contains(t),'country']=t
The result shows like
Company name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
I thought that when the country_list contains large amount of elements, i,e, countries, it would be time-consuming via loop method. Is there any simpler method to tackle with my problem?
回答1:
Here's one way using str.extract:
df['Country'] = df['Company name'].str.extract('('+'|'.join(country_list)+')')
Company name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
回答2:
You need series.str.extract() here:
pat = r'({})'.format('|'.join(country_list))
# pat-->'(China|America|Greece|Pakistan|Finland|Belgium|Japan|British|Australia)'
df['Country']=df['Company name'].str.extract(pat, expand=False)
回答3:
Maybe using findall
in case you have more than one country name in one cell
df["Company name"].str.findall('|'.join(country_list)).str[0]
Out[758]:
0 China
1 Australia
2 Belgium
3 Finland
4 Greece
5 Pakistan
Name: Company name, dtype: object
回答4:
Using str.extract
with Regex
Ex:
import pandas as pd
country_list = ['China','America','Greece','Pakistan','Finland','Belgium','Japan','British','Australia']
df = pd.read_csv(filename)
df["Country"] = df["Company_name"].str.extract("("+"|".join(country_list)+ ")")
print(df)
Output:
Company_name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
来源:https://stackoverflow.com/questions/56278553/how-to-test-string-contains-elements-in-list-and-assign-the-target-element-to-an