Is there anyway to make a Regex that ignores accents?
For example:
preg_replace(\"/$word/i\", \"$word\", $str);
<?php
if (!function_exists('htmlspecialchars_decode')) {
function htmlspecialchars_decode($text) {
return str_replace(array('<','>','"','&'),array('<','>','"','&'),$text);
}
}
function removeMarkings($text)
{
$text=htmlentities($text);
// components (key+value = entity name, replace with key)
$table1=array(
'a'=>'grave|acute|circ|tilde|uml|ring',
'ae'=>'lig',
'c'=>'cedil',
'e'=>'grave|acute|circ|uml',
'i'=>'grave|acute|circ|uml',
'n'=>'tilde',
'o'=>'grave|acute|circ|tilde|uml|slash',
's'=>'zlig', // maybe szlig=>ss would be more accurate?
'u'=>'grave|acute|circ|uml',
'y'=>'acute'
);
// direct (key = entity, replace with value)
$table2=array(
'Ð'=>'D', // not sure about these character replacements
'ð'=>'d', // is an ð pronounced like a 'd'?
'Þ'=>'B', // is a þ pronounced like a 'b'?
'þ'=>'b' // don't think so, but the symbols looked like a d,b so...
);
foreach ($table1 as $k=>$v) $text=preg_replace("/&($k)($v);/i",'\1',$text);
$text=str_replace(array_keys($table2),$table2,$text);
return htmlspecialchars_decode($text);
}
$text="Here two words, one in normal way and another in accent mode java and jává and me searched with java and it found both occurences(higlighted form this sentence) java and jává<br/>";
$find="java"; //The word going to higlight,trying to higlight both java and jává by this seacrh word
$text=utf8_decode($text);
$find=removeMarkings(utf8_decode($find)); $len=strlen($find);
preg_match_all('/\b'.preg_quote($find).'\b/i', removeMarkings($text), $matches, PREG_OFFSET_CAPTURE);
$start=0; $newtext="";
foreach ($matches[0] as $m) {
$pos=$m[1];
$newtext.=substr($text,$start,$pos-$start);
$newtext.="<b>".substr($text,$pos,$len)."</b>";
$start=$pos+$len;
}
$newtext.=substr($text,$start);
echo "<blockquote>",$newtext,"</blockquote>";
?>
I think something like this will help you, I got this one from a forum.. just take a look.
I don't think, there is such a way. That would be locale-dependent and you probably want a "/u" switch first to enable UTF-8 in pattern strings.
I would probably do something like this.
function prepare($pattern)
{
$replacements = Array("a" => "[áàäâ]",
"e" => "[éèëê]" ...);
return str_replace(array_keys($replacements), $replacements, $pattern);
}
pcre_replace("/(" . prepare($word) . ")/ui", "<b>\\1</b>", $str);
In your case, index was different, because unless you used mb_string
you were probably dealing with UTF-8 which uses more than one byte per character.
Set an appropriate locale (such as fr_FR, for example) and use the strcoll function to compare a string ignoring accents.
Regex isn't the tool for you here.
The answer you're looking for is the strtr() function.
This function replaces specified characters in a string, and is exactly what you're looking for.
In your example, Jávã
, you could use a strtr()
call like this:
$replacements = array('á'=>'a', 'ã'=>'a');
$output = strtr("Jávã",$replacements);
$output
will now contain Java
.
Of course, you'll need a bigger $replacements
array to deal with all the characters you want to work with. See the the manual page I linked for some examples of how people are using it.
Note that there isn't a simple blanket list of characters, because firstly it would be huge, and secondly, the same starting character may need to be translated differently in different contexts or languages.
Hope that helps.
Java and Jávã are different words, there's no native support in regex for removing accents, but you can include all possible combinations of characters with or without accents that you want to replace in your regex.
Like preg_replace("/java|Jávã|jáva|javã/i", "<b>$word</b>", $str);
.
Good luck!