Split function when writing an opened file in Python [duplicate]

时间秒杀一切 提交于 2019-12-01 13:34:41

问题


So I have a program in which I am supposed to take an external file, open it in python and then separate each word and each punctuation including commas, apostrophes and full stops. Then I am supposed to save this file as the integer positions of when each word and punctuation occurs in the text.

For eg:- I like to code, because to code is fun. A computer's skeleton.

In my program, I have to save this as:-

1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14

(Help for those who do not understand) 1-I , 2-like, 3-to, 4-code, 5-(,), 6-because, 7-is, 8-fun 9-(.), 10-A, 11-computer, 12-('), 13-s, 14-skeleton

So this has displayed the positions of each of word, even if it repeats, it shows the first occuring postion of the same word

Sorry for the long explanation but here is my actual question. I have done this so far:-

    file = open('newfiles.txt', 'r')
    with open('newfiles.txt','r') as file:
        for line in file:
            for word in line.split():
                 print(word)  

And here is the result:-

  They
  say
  it's
  a
  dog's
  life,.....

Unfortunately this way to split a file does not separate words from punctuation and it does not print out horizontally. .split does not work on a file, does anyone know a more effective way in which i can split the file - words from punctuation? And then store the separated words and punctuation together in a list?


回答1:


The built-in string method .split can only work with simple delimiters. Without an argument, it simply splits on whitespace. For more complex splitting behavior, the easiest thing is to use regex:

>>> s = "I like to code, because to code is fun. A computer's skeleton."
>>> import re
>>> delim = re.compile(r"""\s|([,.;':"])""")
>>> tokens = filter(None, delim.split(s))
>>> idx = {}
>>> result = []
>>> i = 1
>>> for token in tokens:
...     if token in idx:
...         result.append(idx[token])
...     else:
...         result.append(i)
...         idx[token] = i
...         i += 1
...
>>> result
[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]

Also, I don't think you need to iterate over the file line by line, as per your specifications. You should just do something like:

with open('my file.txt') as f:
    s = f.read()

Which will put the entire file as a string into s. Note, I never used open before the with statement, that doesn't make any sense.




回答2:


Use regex to capture the relevant substrings:

import re

my_string = "I like to code, because to code is fun. A computer's skeleton."
matched = re.findall("(\w+)([',.]?)", my_string) # Split up relevant pieces of text

Filter out the empty matches and add to the result:

result = []
for word, punc in matched:
    result.append(word)
    if punc: # Check if punctuation follows the word
        result.append(punc)

Then write the result to your file:

with open("file.txt", "w") as f:
    f.writelines(result) # Write pieces on separate lines

The regex works by finding alpha characters, then checking if there is punctuation following (optionally).




回答3:


You can solve this with using regex and split. Hope this points you in the right direction. Good luck!

import re
str1 = '''I like to code, because to code is fun. A computer's skeleton.'''

#Split your string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", str1) if x not in ['',' ']]
print matches
d = {}
i = 1
list_with_positions = []

#now build the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print list_with_positions

Here is the output. Notice that there is a final period with a position of #9:

['I', 'like', 'to', 'code', ',', 'because', 'to', 'code', 'is', 'fun', '.', 'A', 'computer', "'", 's', 'skeleton', '.']

[1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]



来源:https://stackoverflow.com/questions/41726645/split-function-when-writing-an-opened-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!