Multibyte trim in PHP?

后端 未结 8 1537
小鲜肉
小鲜肉 2020-11-28 08:19

Apparently there\'s no mb_trim in the mb_* family, so I\'m trying to implement one for my own.

I recently found this regex in a comment in php.net:

8条回答
  •  無奈伤痛
    2020-11-28 09:15

    (Ported from a duplicate Q on trim struggles with NBSP.) The following notes are valid as of PHP 7.2+. Mileage may vary with earlier versions (please report in comments).

    PHP trim ignores non-breaking spaces. It only trims spaces found in the basic ASCII range. For reference, the source code for trim reads as follows (ie. no undocumented features with trim):

    (c == ' ' || c == '\n' || c == '\r' || c == '\t' || c == '\v' || c == '\0')
    

    Of the above, aside the ordinary space (ASCII 32, ), these are all ASCII control characters; LF (10: \n), CR (13: \r), HT (9: \t), VT (11: \v), NUL (0: \0). (Note that in PHP, you have to double-quote escaped characters: "\n", "\t" etc.. Otherwise they are parsed as literal \n etc.)

    The following are simple implementations of the three flavors of trim (ltrim, rtrim, trim), using preg_replace, that work with Unicode strings:

    preg_replace('~^\s+~u', '', $string) // == ltrim
    preg_replace('~\s+$~u', '', $string) // == rtrim
    preg_replace('~^\s+|\s+$~us', '', $string) // == trim
    

    Feel free to wrap them into your own mb_*trim functions.

    Per the PCRE specification, the \s "any space" escape sequence character with u Unicode mode on will match all of the following space characters:

    The horizontal space characters are:
    
    U+0009     Horizontal tab (HT)
    U+0020     Space
    U+00A0     Non-break space
    U+1680     Ogham space mark
    U+180E     Mongolian vowel separator
    U+2000     En quad
    U+2001     Em quad
    U+2002     En space
    U+2003     Em space
    U+2004     Three-per-em space
    U+2005     Four-per-em space
    U+2006     Six-per-em space
    U+2007     Figure space
    U+2008     Punctuation space
    U+2009     Thin space
    U+200A     Hair space
    U+202F     Narrow no-break space
    U+205F     Medium mathematical space
    U+3000     Ideographic space
    
    The vertical space characters are:
    
    U+000A     Linefeed (LF)
    U+000B     Vertical tab (VT)
    U+000C     Form feed (FF)
    U+000D     Carriage return (CR)
    U+0085     Next line (NEL)
    U+2028     Line separator
    U+2029     Paragraph separator
    

    You can see a test iteration of preg_replace with the u Unicode flag tackling all of the listed spaces. They are all trimmed as expected, following the PCRE spec. If you targeted only the horizontal spaces above, \h would match them, as \v would with all the vertical spaces.

    The use of \p{Z} seen in some answers will fail on some counts; specifically, with most of the ASCII spaces, and shockingly, also with the Mongolian vowel separator. Kublai Khan would be furious. Here's the list of misses with \p{Z}: U+0009 Horizontal tab (HT), U+000A Linefeed (LF), U+000C Form feed (FF), U+000D Carriage return (CR), U+0085 Next line (NEL), and U+180E Mongolian vowel separator.

    As to why that happens, the above PCRE specification also notes: "\s any character that matches \p{Z} or \h or \v". That is, \s is a superset of \p{Z}. Then, simply use \s in place of \p{Z}. It's more comprehensive and the import is more immediately obvious for someone reading your code, who may not remember the shorties for all character types.

提交回复
热议问题