multibyte-functions

How to get correct list position in multi-byte string using preg_match

你说的曾经没有我的故事 提交于 2020-01-04 05:27:31
问题 I am currently matching HTML using this code: preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position) It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position. For example the returned $match array would give something like: array 0 => array 0 => string '<br />' (length=6) 1 => int 132 1 => array 0 => string 'br' (length=2) 1 => int 133 The real number for the <br /> match is

Is it safe to use `strstr` to search for multibyte UTF-8 characters in a string?

不问归期 提交于 2019-12-22 04:55:19
问题 Following my previous question: Why `strchr` seems to work with multibyte characters, despite man page disclaimer?, I figured out that strchr was a bad choice. Instead I am thinking about using strstr to look for a single character (multi-byte not char ): const char str[] = "This string contains é which is a multi-byte character"; char * pos = strstr(str, "é"); // 'é' = 0xC3A9: 2 bytes printf("%s\n", pos); Ouput: é which is a multi-byte character Which is what I expect: the position of the

php sprintf() with foreign characters?

◇◆丶佛笑我妖孽 提交于 2019-12-20 17:34:12
问题 Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary? I want the following lines to be aligned correctly for a report: 2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98 2011-11-30 A1856 -Rättat xx - 6 594,00 19.18 I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f Using: php-5.3.23-nts-Win32-VC9-x86 回答1: Strings in PHP are basically arrays of

How to handle multibyte string in Python

和自甴很熟 提交于 2019-12-19 08:07:25
问题 There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len function in python, but it return an inaccurate result (i.e number of bytes in this string) japanese = "桜の花びらたち" print japanese print len(japanese)#return 21 instead of 7 Is there any package or function like mb_strlen in PHP? 回答1: Use Unicode strings: # Encoding: UTF-8 japanese = u"桜の花びらたち" print japanese print len

PHP Multi Byte str_replace?

旧巷老猫 提交于 2019-12-17 12:44:47
问题 I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings.. $accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è', 'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø', 'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'); $accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e', 'e

Character Encoding UTF8 Issue when using mb_detect_encoding() with PHP

梦想的初衷 提交于 2019-12-10 10:33:45
问题 I am reading an rss feed http://beersandbeans.com/feed/ The feeds says it is UTF8 format, and I am using simplepie rss to import the content When i grab the content and store it in $content I perform the following: <?php header ('Content-type: text/html; charset=utf-8'); ?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head><body> <?php echo $content; echo $enc = mb_detect_encoding($content, "UTF-8,ISO

multi-byte function to replace preg_match_all?

淺唱寂寞╮ 提交于 2019-12-10 02:58:08
问题 I'm looking for a multi-byte function to replace preg_match_all() . I need one that will give me an array of matched strings, like the $matches argument from preg_match() . The function mb_ereg_match() doesn't seem to do it -- it only gives me a boolean indicating if there were any matches. Looking at the mb_* functions page, I don't offhand see anythng that replaces the functionality of preg_match() . What do I use? Edit I'm an idiot. I originally posted this question asking for a

Character Encoding UTF8 Issue when using mb_detect_encoding() with PHP

南笙酒味 提交于 2019-12-06 08:13:56
I am reading an rss feed http://beersandbeans.com/feed/ The feeds says it is UTF8 format, and I am using simplepie rss to import the content When i grab the content and store it in $content I perform the following: <?php header ('Content-type: text/html; charset=utf-8'); ?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> </head><body> <?php echo $content; echo $enc = mb_detect_encoding($content, "UTF-8,ISO-8859-1", true); echo $content = mb_convert_encoding($content, "UTF-8", $enc); echo $enc = mb_detect

php sprintf() with foreign characters?

情到浓时终转凉″ 提交于 2019-12-03 04:48:33
Seams to be like sprintf have a problem with foregin characters? Or is it me doing something wrong? Looks like it work when removing chars like åäö from the string though. Should that be necessary? I want the following lines to be aligned correctly for a report: 2011-11-27 A1823 -Ref. Leif - 12 873,00 18.98 2011-11-30 A1856 -Rättat xx - 6 594,00 19.18 I'm using sprintf() like this: %-12s %-8s -%-10s -%20s %8.2f Using: php-5.3.23-nts-Win32-VC9-x86 Strings in PHP are basically arrays of bytes (not characters). They cannot work natively with multibyte encodings (such as UTF-8). For details see:

How to handle multibyte string in Python

為{幸葍}努か 提交于 2019-12-01 05:57:13
There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len function in python, but it return an inaccurate result (i.e number of bytes in this string) japanese = "桜の花びらたち" print japanese print len(japanese)#return 21 instead of 7 Is there any package or function like mb_strlen in PHP? Use Unicode strings : # Encoding: UTF-8 japanese = u"桜の花びらたち" print japanese print len(japanese) Note the u in front of the string. To convert a bytestring into Unicode, use decode : "桜の花びらたち".decode(