multibyte

PHP mb_split(), capturing delimiters

不打扰是莪最后的温柔 提交于 2021-02-09 11:55:41
问题 preg_split has an optional PREG_SPLIT_DELIM_CAPTURE flag, which also returns all delimiters in the returned array. mb_split does not. Is there any way to split a multibyte string (not just UTF-8, but all kinds) and capture the delimiters? I'm trying to make a multibyte-safe linebreak splitter, keeping the linebreaks, but would prefer a more genericaly usable solution. Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github (https://github.com/vanderlee/PHP

How does gcc decide the wide character set when calling `mbtowc()`?

倖福魔咒の 提交于 2021-02-07 19:33:30
问题 According to the gcc manual, the option -fwide-exec-charset specifies the wide character set of wide string and character constants at compile time. But what is the wide character set when converting a multi-byte character to a wide character by calling mbtowc() at run time? The POSIX standard says that the character set of multi-byte characters is determined by the LC_CTYPE category of the current locale, but says nothing about the wide character set. I don't have a C standard at hand now so

PHP - replace all non-alphanumeric chars for all languages supported

二次信任 提交于 2021-02-07 17:54:29
问题 Hi i'm actually trying replacing all the NON-alphanumeric chars from a string like this: mb_ereg_replace('/[^a-z0-9\s]+/i','-',$string); first problem is it doesn't replaces chars like "." from the string. Second i would like to add multybite support for all users languages to this method. How can i do that? Any help appriciated, thanks a lot. 回答1: Try the following: preg_replace('/[^\p{L}0-9\s]+/u', '-', $string); When the u flag is used on a regular expression, \p{L} (and \p{Letter} )

removing multibyte characters from a file using sed

帅比萌擦擦* 提交于 2020-12-05 07:46:11
问题 i need to remove all multibyte characters from a file, i dont know what they are so i need to cover the whole range. I can find them using grep like so: grep -P "[\x80-\xFF]" 'myfile' Trying to do a simular thing with sed, but delete them instead. Cheers 回答1: Give this a try: LANG=C sed 's/[\x80-\xFF]//g' filename 回答2: you can use iconv to convert from one encoding to another 来源: https://stackoverflow.com/questions/3521106/removing-multibyte-characters-from-a-file-using-sed

removing multibyte characters from a file using sed

风格不统一 提交于 2020-12-05 07:46:04
问题 i need to remove all multibyte characters from a file, i dont know what they are so i need to cover the whole range. I can find them using grep like so: grep -P "[\x80-\xFF]" 'myfile' Trying to do a simular thing with sed, but delete them instead. Cheers 回答1: Give this a try: LANG=C sed 's/[\x80-\xFF]//g' filename 回答2: you can use iconv to convert from one encoding to another 来源: https://stackoverflow.com/questions/3521106/removing-multibyte-characters-from-a-file-using-sed

glob() can't find file names with multibyte characters on Windows?

前提是你 提交于 2020-01-09 13:01:41
问题 I'm writing a file manager and need to scan directories and deal with renaming files that may have multibyte characters. I'm working on it locally on Windows/Apache PHP 5.3.8, with the following file names in a directory: filename.jpg имяфайла.jpg file件name.jpg פילענאַמע.jpg 文件名.jpg Testing on a live UNIX server woked fine. Testing locally on Windows using glob('./path/*') returns only the first one, filename.jpg . Using scandir() , the correct number of files is returned at least, but I get

Has anyone been able to write out UTF-8 characters using python's xlwt?

北战南征 提交于 2020-01-09 11:12:06
问题 I'm trying to write data to an excel file that includes Japanese characters. I'm using codec.open() to get the data, and that seems to work fine, but I run into this error when I try to write the data: UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-17: ordinal not in range(128) I don't understand why the program would be insisting on using ascii here. When I created a new workbook object, I did so using wb = xlwt.Workbook(encoding='utf-8') and both the program file

Chinese character in source code when UTF-8 settings can't be used [duplicate]

我与影子孤独终老i 提交于 2020-01-05 07:45:13
问题 This question already has an answer here : PHP and C++ for UTF-8 code unit in reverse order in Chinese character (1 answer) Closed 6 years ago . This is the scenario: I can only use the char* data type for the string, not wchar_t * My MS Visual C++ compiler has to be set to MBCS, not UNICODE because the third party source code that I have is using MBCS; Setting it to UNICODE will cause data type issues. I am trying to print chinese characters on a printer which needs to get a character string

How to get correct list position in multi-byte string using preg_match

你说的曾经没有我的故事 提交于 2020-01-04 05:27:31
问题 I am currently matching HTML using this code: preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position) It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position. For example the returned $match array would give something like: array 0 => array 0 => string '<br />' (length=6) 1 => int 132 1 => array 0 => string 'br' (length=2) 1 => int 133 The real number for the <br /> match is

PHP mb_substr() not working correctly?

允我心安 提交于 2020-01-01 07:33:07
问题 This code print mb_substr('éxxx', 0, 1); prints an empty space :( It is supposed to print the first character, é . This seems to work however: print mb_substr('éxxx', 0, 2); But it's not right, because (0, 2) means 2 characters... 回答1: Try passing the encoding parameter to mb_substr, as such: print mb_substr('éxxx', 0, 1, 'utf-8'); The encoding is never detected automatically. 回答2: In practice I've found that, in some systems, multi-byte functions default to ISO-8859-1 for internal encoding.