encoding

Detect encoding in wrongly encoded UTF-8 text file

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-29 09:04:21
问题 I have an encoding issue. I have millions of text files that I need to parse for a language data science project. Each text file is encoded as UTF-8, but I just found that some of these source files are not encoded properly. For example. I have a Chinese text file, that is encoded as UTF-8, but text in the file looks like this: Subject: »Ø¸´: ÎÒÉý¼¶µ½ When I use Python to detect the encoding of this Chinese text file: Chardet tells me the file is encoded as UTF-8: with open(path,'rb') as f:

Thai script seems to lose UTF-8 encoding in java for-each loop

被刻印的时光 ゝ 提交于 2021-01-29 08:49:25
问题 I'm trying to develop an application within Android Studio on Windows 10. PROBLEM: The following string array of Thai words: String[] myTHarr = {"มาก","เชี่ยว","แน่","ม่อน","บ้าน","พูด","เลื่อย","เมื่อ","ช่ำ","แร่"}; ...when processed by the following for-each loop: for (String s:myTHarr){ //s = มาà¸� before executing any of the below code: byte[] utf8EncodedThaiArr = s.getBytes("UTF-8"); String utf8EncodedThai = new String(utf8EncodedThaiArr); //setting breakpoint here // s is still มà

Questions about Base64 encoding

僤鯓⒐⒋嵵緔 提交于 2021-01-29 08:47:19
问题 I have 3 questions about base64: 1) The base64 encoding purpose is binary-to-text. Isn't text going to be sent through web as binary? Then what is good for? 2) In past they used 7-bit communication systems, now it's 8-bit. Then why we still using it now? 3) How it increases the size? I just take 3-bytes with 28-bit and rearrange them to 4-bytes of 6-bit but in total they still 28-bit? 回答1: 1) The purpose is not only binary to text encoding, but also to encode text which uses specific

How to solve “a bytes-like object is required, not 'str'” in create_message() function?

怎甘沉沦 提交于 2021-01-29 08:34:18
问题 I'm getting an error in creating a new message using create_message(). function listed over https://developers.google.com/gmail/api/guides/drafts. def create_message(sender, to, subject, message_text): message = MIMEText(message_text) message['to'] = to message['from'] = sender message['subject'] = subject return {'raw': base64.urlsafe_b64encode(message.as_string())} Error: TypeError: a bytes-like object is required, not 'str' 回答1: base64.urlsafe_b64encode expects bytes , but the type of

Decoding “=C3=A4” in a string

不想你离开。 提交于 2021-01-29 04:44:11
问题 I tried a lot of different things to get my string correctly displayed but I can't make it work. That's the string: f=C3=A4hrt (German word: fährt) My file is encoded in utf-8, the file is loaded within Joomla. I tried both $geschichte->inhalt = utf8_encode($geschichte->inhalt); and $geschichte->inhalt = mb_convert_encoding($geschichte->inhalt, "UTF-8"); but nothing works. I hope someone can help me... 回答1: This encoding has nothing to do with UTF-8 or such, it looks like quoted printable

Scrapy exporting weird symbols into csv file

你说的曾经没有我的故事 提交于 2021-01-29 04:32:14
问题 Ok, so here's the issue. I'm a beginner who has just started to delve into scrapy/python. I use the code below to scrape a website and save the results into a csv. When I look in the command prompt, it turns words like Officiële into Offici\xele. In the csv file, it changes it to officiële. I think this is because it's saving in unicode instead of UTF-8? I however have 0 clue how to change my code, and I've been trying all morning so far. Could anyone help me out here? I'm specifically

Unicode String in urllib.request [duplicate]

佐手、 提交于 2021-01-29 03:52:09
问题 This question already has answers here : UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3 (2 answers) Closed 1 year ago . The short version: I have a variable s = 'bär' . I need to convert s to ASCII so that s = 'b%C3%A4r' . Long version: I'm using urllib.request.urlopen() to read an mp3 pronunciation file from URL. This has worked very well, except I ran into a problem because the URLs often contain unicode characters. For example, the

Open a word document and specify encoding with PowerShell

微笑、不失礼 提交于 2021-01-29 02:52:48
问题 I'm trying to tell PowerShell to open a text file and choose a certain encoding option. By default, when opening this text file in Word manually, it tries to open it with Japanese encoding and so doesn't show certain characters correctly. I've tried lots of different things but nothing works so I'm totally stuck. This text file, amongst others, needs to be converted to PDF on a daily basis. My current script is as follows: $wdFormatPDF = 17 $word = New-Object -ComObject Word.Application $word

Open a word document and specify encoding with PowerShell

谁说胖子不能爱 提交于 2021-01-29 02:43:52
问题 I'm trying to tell PowerShell to open a text file and choose a certain encoding option. By default, when opening this text file in Word manually, it tries to open it with Japanese encoding and so doesn't show certain characters correctly. I've tried lots of different things but nothing works so I'm totally stuck. This text file, amongst others, needs to be converted to PDF on a daily basis. My current script is as follows: $wdFormatPDF = 17 $word = New-Object -ComObject Word.Application $word

Read file from multiple encoding in c# [duplicate]

余生颓废 提交于 2021-01-28 21:31:45
问题 This question already has answers here : How can I detect the encoding/codepage of a text file (20 answers) How to use ReadAllText when file encoding unknown (2 answers) Closed 5 years ago . ENV: C#, VStudio 2013, 4.5 Framework, Winforms, nHapi 2.3 dll I really need help on this. I have tried soo many things and did alot of research with my best friend google ;-). But no luck. I'm building a HL7 sender tools and I'm reading files from a folder. My files come from multiple sources and I found