I would like to split a text into single words using PHP. Do you have any idea how to achieve this?
My approach:
function tokenizer($text) {
$tex
I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);
Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.