Best way to remove '\xad' in Python?
问题 I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15 , using the code: with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', encoding='iso8859-15') as myfile: data=myfile.read().replace('\n', '') data2 = data.split(' ') This returns an array of 'words', but '\xad' remains attached to many entries in