multibyte | 易学教程

PHP mb_split(), capturing delimiters

阅读更多关于 PHP mb_split(), capturing delimiters

问题 preg_split has an optional PREG_SPLIT_DELIM_CAPTURE flag, which also returns all delimiters in the returned array. mb_split does not. Is there any way to split a multibyte string (not just UTF-8, but all kinds) and capture the delimiters? I'm trying to make a multibyte-safe linebreak splitter, keeping the linebreaks, but would prefer a more genericaly usable solution. Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github (https://github.com/vanderlee/PHP

How does gcc decide the wide character set when calling `mbtowc()`?

阅读更多关于 How does gcc decide the wide character set when calling `mbtowc()`?

问题 According to the gcc manual, the option -fwide-exec-charset specifies the wide character set of wide string and character constants at compile time. But what is the wide character set when converting a multi-byte character to a wide character by calling mbtowc() at run time? The POSIX standard says that the character set of multi-byte characters is determined by the LC_CTYPE category of the current locale, but says nothing about the wide character set. I don't have a C standard at hand now so

PHP - replace all non-alphanumeric chars for all languages supported

阅读更多关于 PHP - replace all non-alphanumeric chars for all languages supported

问题 Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this: mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string); first problem is it doesn't replaces chars like "." from the string. Second i would like to add multybite support for all users languages to this method. How can i do that? Any help appriciated, thanks a lot. 回答1: Try the following: preg_replace('/[^\p{L}0-9\s]+/u', '-', $string); When the u flag is used on a regular expression, \p{L} (and \p{Letter} )

removing multibyte characters from a file using sed

阅读更多关于 removing multibyte characters from a file using sed

问题 i need to remove all multibyte characters from a file, i dont know what they are so i need to cover the whole range. I can find them using grep like so: grep -P "[\x80-\xFF]" 'myfile' Trying to do a simular thing with sed, but delete them instead. Cheers 回答1: Give this a try: LANG=C sed 's/[\x80-\xFF]//g' filename 回答2: you can use iconv to convert from one encoding to another 来源： https://stackoverflow.com/questions/3521106/removing-multibyte-characters-from-a-file-using-sed

removing multibyte characters from a file using sed

阅读更多关于 removing multibyte characters from a file using sed

glob() can't find file names with multibyte characters on Windows?

阅读更多关于 glob() can't find file names with multibyte characters on Windows?

问题 I'm writing a file manager and need to scan directories and deal with renaming files that may have multibyte characters. I'm working on it locally on Windows/Apache PHP 5.3.8, with the following file names in a directory: filename.jpg имяфайла.jpg file件name.jpg פילענאַמע.jpg 文件名.jpg Testing on a live UNIX server woked fine. Testing locally on Windows using glob('./path/*') returns only the first one, filename.jpg . Using scandir() , the correct number of files is returned at least, but I get

Has anyone been able to write out UTF-8 characters using python's xlwt?

阅读更多关于 Has anyone been able to write out UTF-8 characters using python's xlwt?

问题 I'm trying to write data to an excel file that includes Japanese characters. I'm using codec.open() to get the data, and that seems to work fine, but I run into this error when I try to write the data: UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-17: ordinal not in range(128) I don't understand why the program would be insisting on using ascii here. When I created a new workbook object, I did so using wb = xlwt.Workbook(encoding='utf-8') and both the program file

Chinese character in source code when UTF-8 settings can't be used [duplicate]

阅读更多关于 Chinese character in source code when UTF-8 settings can't be used [duplicate]

问题 This question already has an answer here : PHP and C++ for UTF-8 code unit in reverse order in Chinese character (1 answer) Closed 6 years ago . This is the scenario: I can only use the char* data type for the string, not wchar_t * My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues. I am trying to print chinese characters on a printer which needs to get a character string

How to get correct list position in multi-byte string using preg_match

阅读更多关于 How to get correct list position in multi-byte string using preg_match

问题 I am currently matching HTML using this code: preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position) It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position. For example the returned $match array would give something like: array 0 => array 0 => string '<br />' (length=6) 1 => int 132 1 => array 0 => string 'br' (length=2) 1 => int 133 The real number for the <br /> match is

PHP mb_substr() not working correctly?

阅读更多关于 PHP mb_substr() not working correctly?

问题 This code print mb_substr('éxxx', 0, 1); prints an empty space :( It is supposed to print the first character, é . This seems to work however: print mb_substr('éxxx', 0, 2); But it's not right, because (0, 2) means 2 characters... 回答1: Try passing the encoding parameter to mb_substr, as such: print mb_substr('éxxx', 0, 1, 'utf-8'); The encoding is never detected automatically. 回答2: In practice I've found that, in some systems, multi-byte functions default to ISO-8859-1 for internal encoding.