unicode-string

Unicode File Writing and Reading in C++?

落花浮王杯 提交于 2019-12-22 22:21:55
问题 Can anyone Provide a Simple Example to Read and Write in the Unicode File a Unicode Character ? 回答1: On linux I use the iconv (link) library which is very standard. An overly simple program is: #include <stdio.h> #include <stdlib.h> #include <iconv.h> #define BUF_SZ 1024 int main( int argc, char* argv[] ) { char bin[BUF_SZ]; char bout[BUF_SZ]; char* inp; char* outp; ssize_t bytes_in; size_t bytes_out; size_t conv_res; if( argc != 3 ) { fprintf( stderr, "usage: convert from to\n" ); return 1;

Importing foreign languages from csv file to Stata

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-22 10:15:51
问题 I am using Stata 12. I have encountered the following problems. I am importing a bunch of .csv files to Stata using the insheet command. The datasets may conclude Russian, Croatian, Turkish, etc. I think they are encoded in "UTF-8". In .csv files, they are correct. After I imported them into Stata, the original strings are incorrect and become the strange characters. Would you please help me with that? Does Stat-Transfer can solve the problems? Does it support .csv format? For example, the

Python urllib.request and utf8 decoding question

╄→гoц情女王★ 提交于 2019-12-21 06:18:05
问题 I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script: #!/usr/bin/env python3.0 import urllib.request site = "http://reddit.com/" site = urllib.request.urlopen(site) site = site.read() site = site.decode('utf8') print("Content-type: text/html\n\n") print(site) This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the

Conversion of UTF-8 char * to CString

孤街醉人 提交于 2019-12-21 05:28:06
问题 How do I convert a string in UTF-8 char* to CString? 回答1: bool Utf8ToCString( CString& cstr, const char* utf8Str ) { size_t utf8StrLen = strlen(utf8Str); if( utf8StrLen == 0 ) { cstr.Empty(); return true; } LPTSTR* ptr = cstr.GetBuffer(utf8StrLen+1); #ifdef UNICODE // CString is UNICODE string so we decode int newLen = MultiByteToWideChar( CP_UTF8, 0, utf8Str, utf8StrLen, ptr, utf8StrLen+1 ); if( !newLen ) { cstr.ReleaseBuffer(0); return false; } #else WCHAR* buf = (WCHAR*)malloc(utf8StrLen);

Python - BeautifulSoup html parsing handle gbk encoding poorly - Chinese webscraping problem

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-19 10:23:42
问题 I have been tinkering with the following script: # -*- coding: utf8 -*- import codecs from BeautifulSoup import BeautifulSoup, NavigableString, UnicodeDammit import urllib2,sys import time try: import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py timeoutsocket.setDefaultSocketTimeout(10) except ImportError: pass h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f' address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read() soup

Python .split() without 'u

大兔子大兔子 提交于 2019-12-19 06:06:00
问题 In Python, if I have a string like: a =" Hello - to - everybody" And I do a.split('-') then I get [u'Hello', u'to', u'everybody'] This is just an example. How can I get a simple list without that annoying u'?? 回答1: The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent. The u is purely used to let you

Python .split() without 'u

痴心易碎 提交于 2019-12-19 06:05:14
问题 In Python, if I have a string like: a =" Hello - to - everybody" And I do a.split('-') then I get [u'Hello', u'to', u'everybody'] This is just an example. How can I get a simple list without that annoying u'?? 回答1: The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent. The u is purely used to let you

Python 3 - TypeError: a bytes-like object is required, not 'str'

走远了吗. 提交于 2019-12-18 16:53:42
问题 I'm working on a lesson from Udacity and am having some issue trying to find out if the result from this site returns true or false. I get the TypeError with the code below. from urllib.request import urlopen #check text for curse words def check_profanity(): f = urlopen("http://www.wdylike.appspot.com/?q=shit") output = f.read() f.close() print(output) if "b'true'" in output: print("There is a profane word in the document") check_profanity() The output prints b'true' and I'm not really sure

Convert hash.digest() to unicode

只愿长相守 提交于 2019-12-18 14:56:30
问题 import hashlib string1 = u'test' hashstring = hashlib.md5() hashstring.update(string1) string2 = hashstring.digest() unicode(string2) UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 1: ordinal not in range(128) The string HAS to be unicode for it to be any use to me, can this be done? Using python 2.7 if that helps... 回答1: The result of .digest() is a bytestring¹, so converting it to Unicode is pointless. Use .hexdigest() if you want a readable representation. ¹ Some

How to work with unicode in Python

断了今生、忘了曾经 提交于 2019-12-18 13:04:10
问题 I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I