问题
I was trying to split a string on non-alphanumeric characters or simple put I want to split words. The approach that immediately came to my mind is to use regular expressions.
Example:$string = 'php_php-php php';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
But there are two problems that I see with this approach.
- It is not a native php function, and is totally dependent on the PCRE Library running on server.
- An equally important problem is that what if I have punctuation in a word
Example:$string = 'U.S.A-men's-vote';
$splitArr = preg_split('/[^a-z0-9]/i', $string);
Now this will spilt the string as[{U}{S}{A}{men}{s}{vote}]
But I want it as[{U.S.A}{men's}{vote}]
So my question is that:
- How can we split them according to words?
- Is there a possibility to do it with php native function or in some other way where we are not dependent?
Regards
回答1:
Sounds like a case for str_word_count() using the oft forgotten 1 or 2 value for the second argument, and with a 3rd argument to include hyphens, full stops and apostrophes (or whatever other characters you wish to treat as word-parts) as part of a word; followed by an array_walk() to trim those characters from the beginning or end of the resultant array values, so you only include them when they're actually embedded in the "word"
回答2:
Either you have PHP installed (then you also have PCRE), or you don't. So your first point is a non-issue.
Then, if you want to exclude punctuation from your splitting delimiters, you need to add them to your character class:
preg_split('/[^a-z0-9.\']+/i', $string);
If you want to treat punctuation characters differently depending on context (say, make a dot only be a delimiter if followed by whitespace), you can do that, too:
preg_split('/\.\s+|[^a-z0-9.\']+/i', $string);
回答3:
As per my comment, you might want to try (add as many separators as needed)
$splitArr = preg_split('/[\s,!\?;:-]+|[\.]\s+/', $string, -1, PREG_SPLIT_NO_EMPTY);
You'd then have to handle the case of a "quoted" word (it's not so easy to do in a regular expression, because 'is" "this' quoted? And how?).
So I think it's best to keep ' and " within words (so that "it's" is a single word, and "they 'll" is two words) and then deal with those cases separately. For example a regexp would have some trouble in correctly handling
they 're 'just friends'. Or that's what they say.
while having "'re" and a sequence of words of which the first is left-quoted and the last is right-quoted, the first not being a known sequence ('s, 're, 'll, 'd ...) may be handled at application level.
回答4:
This is not a php-problem, but a logical one.
Words could be concatenated by a -. Abbrevations could look like short sentences.
You can match your example directly by creating a solution that fits only on this particular phrase. But you cant get a solution for all possible phrases. That would require a neuronal-computing based content-recognition.
来源:https://stackoverflow.com/questions/13047610/split-string-on-non-alphanumerics-in-php-is-it-possible-with-phps-native-funct