问题
I need to parse text of text file into two categories:
- University
- Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad)
but the text file contain following text:
"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
"London School of Economics"
"Lahore School of Economics, Lahore"
I have written code that separate locations on the basis of 'comma'. The below code only work for first line of file and prints 'Lahore' after that it give following error 'list index out of range'.
file = open(path,'r')
content = file.read().split('\n')
for line in content:
rep = line.replace('"','')
loc = rep.split(',')[1]
print "uni: "+replace
print "Loc: "+str(loc)
Please help I'm stuck on this. Thanks
回答1:
It would appear that you can only be certain that a line has a location if there is a comma. So it would make sense to parse the file in two passes. The first pass can build a set
holding all known locations. You can start this off with some known examples or problem cases.
Pass two could then also use the comma to match known locations but if there is no comma, the line is split into a set of words. The intersection of these with the location set should give you the location. If there is no intersection then it is flagged as "unknown".
locations = set(["London", "Faisalabad"])
with open(path, 'r') as f_input:
unknown = 0
# Pass 1, build a set of locations
for line in f_input:
line = line.strip(' ,"\n')
if ',' in line:
loc = line.rsplit("," ,1)[1].strip()
locations.add(loc)
# Pass 2, try and find location in line
f_input.seek(0)
for line in f_input:
line = line.strip(' "\n')
if ',' in line:
uni, loc = line.rsplit("," ,1)
loc = loc.strip()
else:
uni = line
loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)
if loc_matches:
loc = list(loc_matches)[0]
else:
loc = "<unknown location>"
unknown += 1
uni = uni.strip()
print "uni:", uni
print "Loc:", loc
print "Unknown locations:", unknown
Output would be:
uni: Imperial College of Business Studies
Loc: Lahore
uni: Government College University Faisalabad
Loc: Faisalabad
uni: Imperial College of Business Studies Lahore
Loc: Lahore
uni: University of Peshawar
Loc: Peshawar
uni: University of Sindh
Loc: Jamshoro
uni: London School of Economics
Loc: London
uni: Lahore School of Economics
Loc: Lahore
Unknown locations: 0
回答2:
Your input file does not have commas on every line, causing the code to fail. You could do something like
if ',' in line:
loc = rep.split(',')[1].strip()
else:
loc = rep.split()[-1].strip()
to handle the lines without comma differently, or simply reformat the input.
回答3:
You can split using a comma, the result is always a list, you can check its size, if it is more than one, then you had already at least one comma, otherwise (if the size is one) you didn't have any comma
>>> word = "somethign without a comma"
>>> afterSplit = word.split(',')
>>> afterSplit
['somethign without a comma']
>>> word2 = "something with, just one comma"
>>> afterSplit2 = word2.split(',')
>>> afterSplit2
['something with', ' just one comma']
回答4:
I hope this will work, but I couldn't get 'London' though. May be the data should be constant.
f_data = open('places.txt').readlines()
stop_words = ['school', 'Economics', 'University', 'College']
places = []
for p in f_data:
p = p.replace('"', '')
if ',' in p:
city = p.split(',')[-1].strip()
else:
city = p.split(' ')[-1].strip()
if city not in places and city not in stop_words:
places.append(city)
print places
o/p [' Lahore', ' Faisalabad', 'Lahore', 'Peshawar', ' Jamshoro']
来源:https://stackoverflow.com/questions/32095824/split-text-in-text-file-on-the-basis-of-comma-and-space-python