Split text in text file on the basis of comma and space (python)

时光怂恿深爱的人放手 提交于 2019-12-13 05:59:44

问题


I need to parse text of text file into two categories:

  1. University
  2. Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad)

but the text file contain following text:

"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
"London School of Economics"
"Lahore School of Economics, Lahore"

I have written code that separate locations on the basis of 'comma'. The below code only work for first line of file and prints 'Lahore' after that it give following error 'list index out of range'.

file = open(path,'r')
content = file.read().split('\n')

for line in content:
    rep = line.replace('"','')
    loc = rep.split(',')[1]
    print "uni: "+replace
    print "Loc: "+str(loc)

Please help I'm stuck on this. Thanks


回答1:


It would appear that you can only be certain that a line has a location if there is a comma. So it would make sense to parse the file in two passes. The first pass can build a set holding all known locations. You can start this off with some known examples or problem cases.

Pass two could then also use the comma to match known locations but if there is no comma, the line is split into a set of words. The intersection of these with the location set should give you the location. If there is no intersection then it is flagged as "unknown".

locations = set(["London", "Faisalabad"])

with open(path, 'r') as f_input:
    unknown = 0
    # Pass 1, build a set of locations
    for line in f_input:
        line = line.strip(' ,"\n')
        if ',' in line:
            loc = line.rsplit("," ,1)[1].strip()
            locations.add(loc)

    # Pass 2, try and find location in line
    f_input.seek(0)

    for line in f_input:
        line = line.strip(' "\n')
        if ',' in line:
            uni, loc = line.rsplit("," ,1)
            loc = loc.strip()
        else:
            uni = line
            loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)

            if loc_matches:
                loc = list(loc_matches)[0]
            else:
                loc = "<unknown location>"
                unknown += 1

        uni = uni.strip()

        print "uni:", uni
        print "Loc:", loc

    print "Unknown locations:", unknown

Output would be:

uni: Imperial College of Business Studies
Loc: Lahore
uni: Government College University Faisalabad
Loc: Faisalabad
uni: Imperial College of Business Studies Lahore
Loc: Lahore
uni: University of Peshawar
Loc: Peshawar
uni: University of Sindh
Loc: Jamshoro
uni: London School of Economics
Loc: London
uni: Lahore School of Economics
Loc: Lahore
Unknown locations: 0



回答2:


Your input file does not have commas on every line, causing the code to fail. You could do something like

if ',' in line:
    loc = rep.split(',')[1].strip()
else:
    loc = rep.split()[-1].strip()

to handle the lines without comma differently, or simply reformat the input.




回答3:


You can split using a comma, the result is always a list, you can check its size, if it is more than one, then you had already at least one comma, otherwise (if the size is one) you didn't have any comma

>>> word = "somethign without a comma"
>>> afterSplit = word.split(',')
>>> afterSplit
['somethign without a comma']
>>> word2 = "something with, just one comma"
>>> afterSplit2 = word2.split(',')
>>> afterSplit2
['something with', ' just one comma']



回答4:


I hope this will work, but I couldn't get 'London' though. May be the data should be constant.

f_data = open('places.txt').readlines()
stop_words = ['school', 'Economics', 'University', 'College']
places = []
for p in f_data:
    p = p.replace('"', '')
    if ',' in p:
        city = p.split(',')[-1].strip()
    else:
        city = p.split(' ')[-1].strip()
    if city not in places and city not in stop_words:
            places.append(city)
print places

o/p [' Lahore', ' Faisalabad', 'Lahore', 'Peshawar', ' Jamshoro']



来源:https://stackoverflow.com/questions/32095824/split-text-in-text-file-on-the-basis-of-comma-and-space-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!