Split string on non-alphanumerics in PHP? Is it possible with php's native function?

泪湿孤枕 提交于 2019-12-13 20:05:13

问题


I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.

Example:
$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);

But there are two problems that I see with this approach.

  1. It is not a native php function, and is totally dependent on the PCRE Library running on server.
  2. An equally important problem is that what if I have punctuation in a word
    Example:
    $string = 'U.S.A-men's-vote';
    $splitArr = preg_split('/[^a-z0-9]/i', $string);

    Now this will spilt the string as [{U}{S}{A}{men}{s}{vote}]
    But I want it as [{U.S.A}{men's}{vote}]

So my question is that:

  • How can we split them according to words?
  • Is there a possibility to do it with php native function or in some other way where we are not dependent?

Regards


回答1:


Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"




回答2:


Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.

Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:

preg_split('/[^a-z0-9.\']+/i', $string);

If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:

preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);



回答3:


As per my comment, you might want to try (add as many separators as needed)

$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);

You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).

So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling

they 're 'just friends'. Or that's what they say.

while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.




回答4:


This is not a php-problem, but a logical one.

Words could be concatenated by a -. Abbrevations could look like short sentences.

You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.



来源:https://stackoverflow.com/questions/13047610/split-string-on-non-alphanumerics-in-php-is-it-possible-with-phps-native-funct

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!