问题
So I have a program in which I am supposed to take an external file, open it in python and then separate each word and each punctuation including commas, apostrophes and full stops. Then I am supposed to save this file as the integer positions of when each word and punctuation occurs in the text.
For eg:- I like to code, because to code is fun. A computer's skeleton.
In my program, I have to save this as:-
1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14
(Help for those who do not understand) 1-I , 2-like, 3-to, 4-code, 5-(,), 6-because, 7-is, 8-fun 9-(.), 10-A, 11-computer, 12-('), 13-s, 14-skeleton
So this has displayed the positions of each of word, even if it repeats, it shows the first occuring postion of the same word
Sorry for the long explanation but here is my actual question. I have done this so far:-
file = open('newfiles.txt', 'r')
with open('newfiles.txt','r') as file:
for line in file:
for word in line.split():
print(word)
And here is the result:-
They
say
it's
a
dog's
life,.....
Unfortunately this way to split a file does not separate words from punctuation and it does not print out horizontally. .split does not work on a file, does anyone know a more effective way in which i can split the file - words from punctuation? And then store the separated words and punctuation together in a list?
回答1:
The built-in string method .split can only work with simple delimiters. Without an argument, it simply splits on whitespace. For more complex splitting behavior, the easiest thing is to use regex:
>>> s = "I like to code, because to code is fun. A computer's skeleton."
>>> import re
>>> delim = re.compile(r"""\s|([,.;':"])""")
>>> tokens = filter(None, delim.split(s))
>>> idx = {}
>>> result = []
>>> i = 1
>>> for token in tokens:
... if token in idx:
... result.append(idx[token])
... else:
... result.append(i)
... idx[token] = i
... i += 1
...
>>> result
[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]
Also, I don't think you need to iterate over the file line by line, as per your specifications. You should just do something like:
with open('my file.txt') as f:
s = f.read()
Which will put the entire file as a string into s. Note, I never used open before the with statement, that doesn't make any sense.
回答2:
Use regex to capture the relevant substrings:
import re
my_string = "I like to code, because to code is fun. A computer's skeleton."
matched = re.findall("(\w+)([',.]?)", my_string) # Split up relevant pieces of text
Filter out the empty matches and add to the result:
result = []
for word, punc in matched:
result.append(word)
if punc: # Check if punctuation follows the word
result.append(punc)
Then write the result to your file:
with open("file.txt", "w") as f:
f.writelines(result) # Write pieces on separate lines
The regex works by finding alpha characters, then checking if there is punctuation following (optionally).
回答3:
You can solve this with using regex and split. Hope this points you in the right direction. Good luck!
import re
str1 = '''I like to code, because to code is fun. A computer's skeleton.'''
#Split your string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", str1) if x not in ['',' ']]
print matches
d = {}
i = 1
list_with_positions = []
#now build the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print list_with_positions
Here is the output. Notice that there is a final period with a position of #9:
['I', 'like', 'to', 'code', ',', 'because', 'to', 'code', 'is', 'fun', '.', 'A', 'computer', "'", 's', 'skeleton', '.']
[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]
来源:https://stackoverflow.com/questions/41726645/split-function-when-writing-an-opened-file-in-python