how to remove dynamically Arabic diacritic I\'m designing an ebook \"chm\" and have multi html pages contain Arabic text but some time the search engine want highlight so
Try this
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/
The code is C# not javascript though. Still trying to figure out how to achieve this in javascript
EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.
var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;
function isCharTashkeel(letter)
{
if (typeof(letter) == "undefined" || letter == null)
return false;
var code = letter.charCodeAt(0);
//1648 - superscript alif
//1619 - madd: ~
return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}
function stripTashkeel(input)
{
var output = "";
//todo consider using a stringbuilder to improve performance
for (var i = 0; i < input.length; i++)
{
var letter = input.charAt(i);
if (!isCharTashkeel(letter)) //tashkeel
output += letter;
}
return output;
}
Edit: Here is another way to do it using BuckData http://qurandev.github.com/
Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp