encoding

Best way to remove '\xad' in Python?

风格不统一 提交于 2020-07-20 09:15:10
问题 I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15 , using the code: with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', encoding='iso8859-15') as myfile: data=myfile.read().replace('\n', '') data2 = data.split(' ') This returns an array of 'words', but '\xad' remains attached to many entries in

Best way to remove '\xad' in Python?

元气小坏坏 提交于 2020-07-20 09:13:39
问题 I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15 , using the code: with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', encoding='iso8859-15') as myfile: data=myfile.read().replace('\n', '') data2 = data.split(' ') This returns an array of 'words', but '\xad' remains attached to many entries in

Best way to remove '\xad' in Python?

你。 提交于 2020-07-20 09:13:18
问题 I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15 , using the code: with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', encoding='iso8859-15') as myfile: data=myfile.read().replace('\n', '') data2 = data.split(' ') This returns an array of 'words', but '\xad' remains attached to many entries in

UTF-8 - contradictory definitions

自作多情 提交于 2020-07-10 10:25:26
问题 My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char noting that bit 7 is always set and this tells utf-8 parsers

Gibberish text output because of encoding in web scraping

人盡茶涼 提交于 2020-07-09 14:20:37
问题 I'm trying to get a text in Persian language from Google Translate, and the best encoding type for Persian is UTF-8. Google Translate uses Javascript to render its HTML codes, so I'm using html-requests module for this. What I have problem with is the output that I get each time, both either when I use print() or when I try to write it into a file. Both ways will give me a gibberish non-Persian text, and I know it's because of the encoding or something like this. So I was trying to change

Django - pdf response has wrong encoding - xhtml2pdf

为君一笑 提交于 2020-07-08 11:54:12
问题 I'm working on an invoice PDF generator on my Django website. I use xhtml2pdf . It seems to be working but encodings is not correct. There are wrong signs/characters when I use diacritics. This is a view: def render_to_pdf(template_src, context_dict): template = get_template("pdf/pdf.html") context = context_dict html = template.render(context) result = StringIO.StringIO() pdf = pisa.pisaDocument(StringIO.StringIO(html.encode('utf-8'), result) if not pdf.err: return HttpResponse(result

How to setup Visual Studio Code detect and set correct encoding on file open

巧了我就是萌 提交于 2020-07-04 08:28:39
问题 I recently started to use Visual Studio Code on Server Systems where I did not have Studio IDE installed. I like it very much but run into a problem. When I open a file (used Notepad++ before) the editor detect encoding and set it for me. Have many files on windows servers with windows-1252 still but vscode just use UTF-8 by default. I know I can reopen with encoding Western (Windows 1252) but I forgot it often and I sometimes destroyed some content saving it. So I did not find any parameter

Weird leading characters utf-8/utf-16 encoding in Python

柔情痞子 提交于 2020-07-03 11:54:29
问题 I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format. With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters. I tried to remove all trailing and leading characters but still the error persists. Sample of code: #!/usr/bin/env python2 # -*- coding: utf-8 -*- import chardet def myEncode(s, pattern): try: s.strip() u = unicode(s, pattern) print chardet.detect(u.encode

Weird leading characters utf-8/utf-16 encoding in Python

两盒软妹~` 提交于 2020-07-03 11:53:27
问题 I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format. With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters. I tried to remove all trailing and leading characters but still the error persists. Sample of code: #!/usr/bin/env python2 # -*- coding: utf-8 -*- import chardet def myEncode(s, pattern): try: s.strip() u = unicode(s, pattern) print chardet.detect(u.encode

Encoding issue : decode Quoted-Printable string in Python

别说谁变了你拦得住时间么 提交于 2020-07-03 03:24:08
问题 In Python, I got a string encoded in Quoted-Printable encoding mystring="=AC=E9" This string should be printed as é So I want to decode it and encode it in UTF-8, I guess. I understand that something is possible through import quopri quopri.decodestring('=A3=E9') But then, I'm completely lost. How would you do decode/encode this string to get printed properly? 回答1: import quopri Encoding: You can encode the character 'é' to Quoted-Printable using quopri.encodestring(). It takes a bytes object