问题
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
This is my text, i need to create a data frame with 1 column for the state name, and another column for the town name, i know how to remove the university names. but how do i tell pandas that at every [edit] is a new state.
expected output dataframe
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alaska Fairbanks
Arizona Flagstaff
Arizona Tempe
Arizona Tucson
I am not sure if i can use read_table, if i can how? I did import everything into a dataframe but the state and the city are in the same column. Also i tried with a list, but the problem is still the same.
I need something that works like if there is a [edit] in the line then all the value after it and before the next [edit] line is the state of the lines in between
回答1:
Maybe pandas
can do it but you can do it easily.
data = '''Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)'''
# ---
result = []
state = None
for line in data.split('\n'):
if line.endswith('[edit]'):
# remember new state
state = line[:-6] # without `[edit]`
else:
# add state, city to result
city, rest = line.split(' ', 1)
result.append( [state, city] )
# --- display ---
for state, city in result:
print(state, city)
if you read from file then
result = []
state = None
with open('your_file') as f:
for line in f:
line = line.strip() # remove '\n'
if line.endswith('[edit]'):
# remember new state
state = line[:-6] # without `[edit]`
else:
# add state, city to result
city, rest = line.split(' ', 1)
result.append( [state, city] )
# --- display ---
for state, city in result:
print(state, city)
Now you can use result
to create DataFrame
.
回答2:
Using Pandas, you could do the following:
import pandas as pd
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
df['index'] = df.groupby('groupno').cumcount()
df['state'] = df.groupby('groupno')['town'].transform('first')
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
df = df.loc[~df['is_state']]
df = df[['state','town']]
which yields
state town
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
5 Alaska Fairbanks
7 Arizona Flagstaff
8 Arizona Tempe
9 Arizona Tucson
Here is a breakdown of what the code is doing. After loading the text file into a DataFrame, use str.contains
to identify the rows which are states. Use cumsum
to take a cumulative sum of the True/False values, where True is treated as 1 and False as 0.
df = pd.read_table('data', sep='\n', header=None, names=['town'])
df['is_state'] = df['town'].str.contains(r'\[edit\]')
df['groupno'] = df['is_state'].cumsum()
# town is_state groupno
# 0 Alabama[edit] True 1
# 1 Auburn (Auburn University)[1] False 1
# 2 Florence (University of North Alabama) False 1
# 3 Jacksonville (Jacksonville State University)[2] False 1
# 4 Alaska[edit] True 2
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2
# 6 Arizona[edit] True 3
# 7 Flagstaff (Northern Arizona University)[6] False 3
# 8 Tempe (Arizona State University) False 3
# 9 Tucson (University of Arizona) False 3
Now for each groupno
number, we can assign a unique integer for each row in the group:
df['index'] = df.groupby('groupno').cumcount()
# town is_state groupno index
# 0 Alabama[edit] True 1 0
# 1 Auburn (Auburn University)[1] False 1 1
# 2 Florence (University of North Alabama) False 1 2
# 3 Jacksonville (Jacksonville State University)[2] False 1 3
# 4 Alaska[edit] True 2 0
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1
# 6 Arizona[edit] True 3 0
# 7 Flagstaff (Northern Arizona University)[6] False 3 1
# 8 Tempe (Arizona State University) False 3 2
# 9 Tucson (University of Arizona) False 3 3
Again for each groupno
number, we can find the state by selecting the first town in each group:
df['state'] = df.groupby('groupno')['town'].transform('first')
# town is_state groupno index state
# 0 Alabama[edit] True 1 0 Alabama[edit]
# 1 Auburn (Auburn University)[1] False 1 1 Alabama[edit]
# 2 Florence (University of North Alabama) False 1 2 Alabama[edit]
# 3 Jacksonville (Jacksonville State University)[2] False 1 3 Alabama[edit]
# 4 Alaska[edit] True 2 0 Alaska[edit]
# 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1 Alaska[edit]
# 6 Arizona[edit] True 3 0 Arizona[edit]
# 7 Flagstaff (Northern Arizona University)[6] False 3 1 Arizona[edit]
# 8 Tempe (Arizona State University) False 3 2 Arizona[edit]
# 9 Tucson (University of Arizona) False 3 3 Arizona[edit]
We basically have the desired DataFrame; all that's left is to prettify the result.
We can remove the [edit]
from the state
s and everything after the first parenthesis from the town
s using str.replace
:
df['state'] = df['state'].str.replace(r'\[edit\]', '')
df['town'] = df['town'].str.replace(r' \(.+$', '')
Remove the rows where the town
is actually a state:
df = df.loc[~df['is_state']]
And finally, keep only the desired columns:
df = df[['state','town']]
来源:https://stackoverflow.com/questions/40413380/read-table-in-pandas-how-to-get-input-from-text-to-a-dataframe