I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web.
I need to format and write in a file
Your code has several problems, though the main culprit is that your for loop does not modify the contents of the xstring:
I will address all the issues in your code here:
you cannot write paths like this with single \, as \t will be interpreted as a tabulator, and \f as a linefeed character. You must double them. I know it was an example here, but such confusions often arise:
with open('path\\to\\file.txt', 'r') as infile:
xstring = infile.readlines()
The following infile.close line is wrong. It does not call the close method, it does not actually do anything. Furthermore, your file was closed already by the with clause if you see this line in any answer anywhere, please just downvote the answer outright with the comment saying that file.close is wrong, should be file.close().
The following should work, but you need to be aware that it replacing every non-ascii character with ' ' it will break words such as naïve and café
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
But here is the reason why your code fails with an unicode exception: you are not modifying the elements of xstring at all, that is, you are calculating the line with ascii characters removed, yes, but that is a new value, that is never stored into the list:
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
Instead it should be:
for i, line in enumerate(xstring):
xstring[i] = remove_non_ascii(line)
or my preferred very pythonic:
xstring = [ remove_non_ascii(line) for line in xstring ]
Though these Unicode Errors occur mainly just because you are using Python 2.7 for handling pure Unicode text, something for which recent Python 3 versions are way ahead, thus I'd recommend you that if you are in very beginning with task that you'd upgrade to Python 3.4+ soon.