encoding | 易学教程

Best way to remove '\xad' in Python?

阅读更多关于 Best way to remove '\xad' in Python?

问题 I'm trying to build a corpus from the .txt file found at this link. I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15 , using the code: with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r', encoding='iso8859-15') as myfile: data=myfile.read().replace('\n', '') data2 = data.split(' ') This returns an array of 'words', but '\xad' remains attached to many entries in

Best way to remove '\xad' in Python?

阅读更多关于 Best way to remove '\xad' in Python?

Best way to remove '\xad' in Python?

阅读更多关于 Best way to remove '\xad' in Python?

UTF-8 - contradictory definitions

阅读更多关于 UTF-8 - contradictory definitions

问题 My understanding of UTF-8 encoding is that the first byte of a UTF-8 char carries either data in the lower 7 bits (0-6) with high bit (7) clear for single byte ASCII range code-points data in the lower 5 bits (0-4), with high bits 7-5 = 110 to indicate a 2 byte char data in the lower 4 bits (0-3), with high bits 7-4 = 1110 to indicate a 3 byte char data in the lower 5 bits (0-2), with high bits 7-3 = 11110 to indicate a 4 byte char noting that bit 7 is always set and this tells utf-8 parsers

Gibberish text output because of encoding in web scraping

阅读更多关于 Gibberish text output because of encoding in web scraping

问题 I'm trying to get a text in Persian language from Google Translate, and the best encoding type for Persian is UTF-8. Google Translate uses Javascript to render its HTML codes, so I'm using html-requests module for this. What I have problem with is the output that I get each time, both either when I use print() or when I try to write it into a file. Both ways will give me a gibberish non-Persian text, and I know it's because of the encoding or something like this. So I was trying to change

Django - pdf response has wrong encoding - xhtml2pdf

阅读更多关于 Django - pdf response has wrong encoding - xhtml2pdf

问题 I'm working on an invoice PDF generator on my Django website. I use xhtml2pdf . It seems to be working but encodings is not correct. There are wrong signs/characters when I use diacritics. This is a view: def render_to_pdf(template_src, context_dict): template = get_template("pdf/pdf.html") context = context_dict html = template.render(context) result = StringIO.StringIO() pdf = pisa.pisaDocument(StringIO.StringIO(html.encode('utf-8'), result) if not pdf.err: return HttpResponse(result

How to setup Visual Studio Code detect and set correct encoding on file open

阅读更多关于 How to setup Visual Studio Code detect and set correct encoding on file open

问题 I recently started to use Visual Studio Code on Server Systems where I did not have Studio IDE installed. I like it very much but run into a problem. When I open a file (used Notepad++ before) the editor detect encoding and set it for me. Have many files on windows servers with windows-1252 still but vscode just use UTF-8 by default. I know I can reopen with encoding Western (Windows 1252) but I forgot it often and I sometimes destroyed some content saving it. So I did not find any parameter

Weird leading characters utf-8/utf-16 encoding in Python

阅读更多关于 Weird leading characters utf-8/utf-16 encoding in Python

问题 I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format. With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters. I tried to remove all trailing and leading characters but still the error persists. Sample of code: #!/usr/bin/env python2 # -*- coding: utf-8 -*- import chardet def myEncode(s, pattern): try: s.strip() u = unicode(s, pattern) print chardet.detect(u.encode

Weird leading characters utf-8/utf-16 encoding in Python

阅读更多关于 Weird leading characters utf-8/utf-16 encoding in Python

Encoding issue : decode Quoted-Printable string in Python

阅读更多关于 Encoding issue : decode Quoted-Printable string in Python

问题 In Python, I got a string encoded in Quoted-Printable encoding mystring="=AC=E9" This string should be printed as é So I want to decode it and encode it in UTF-8, I guess. I understand that something is possible through import quopri quopri.decodestring('=A3=E9') But then, I'm completely lost. How would you do decode/encode this string to get printed properly? 回答1: import quopri Encoding: You can encode the character 'é' to Quoted-Printable using quopri.encodestring(). It takes a bytes object