diacritics | 易学教程

Regex to ignore accents? PHP

阅读更多关于 Regex to ignore accents? PHP

Is there anyway to make a Regex that ignores accents? For example: preg_replace("/$word/i", "<b>$word</b>", $str); The "i" in the regex is to ignore case sensitive, but is there anyway to match, for example java with Jávã ? I did try to make a copy of the $str, change the content to a no accent string and find the index of all the occurrences. But the index of the 2 strings seems to be different, even though it's just with no accents. (I did a research, but all I could found is how to remove accents from a string) I don't think, there is such a way. That would be locale-dependent and you

normalizing accented characters in MySQL queries

阅读更多关于 normalizing accented characters in MySQL queries

I'd like to be able to do queries that normalize accented characters, so that for example: é, è, and ê are all treated as 'e', in queries using '=' and 'like'. I have a row with username field set to ' rené ', and I'd like to be able to match on it with both ' rene ' and ' rené '. I'm attempting to do this with the 'collate' clause in MySQL 5.0.8. I get the following error: mysql> select * from User where username = 'rené' collate utf8_general_ci; ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1' FWIW, my table was created with: CREATE TABLE `User` ( `id`

Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

阅读更多关于 Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

There is a very similar question already. One of the solutions uses code like this one: string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s Which works wonders, until you notice it also removes spaces, dots, dashes, and who knows what else. I'm not really sure how the first code works, but could it be made to strip only accents? Or at the very least be given a list of chars to preserve? My knowledge of regexps is small, but I tried (to no avail): /[^\-x00-\x7F]/n # So it would leave the dash alone I'm about to do something like this: string.mb_chars.normalize(:kd).gsub('-', '__DASH__'

How to replace unicode characters by ascii characters in Python (perl script given)?

阅读更多关于 How to replace unicode characters by ascii characters in Python (perl script given)?

I am trying to learn python and couldn't figure out how to translate the following perl script to python: #!/usr/bin/perl -w use open qw(:std :utf8); while(<>) { s/\x{00E4}/ae/; s/\x{00F6}/oe/; s/\x{00FC}/ue/; print; } The script just changes unicode umlauts to alternative ascii output. (So the complete output is in ascii.) I would be grateful for any hints. Thanks! Use the fileinput module to loop over standard input or a list of files, decode the lines you read from UTF-8 to unicode objects then map any unicode characters you desire with the translate method translit.py would look like this:

Test if string contains only letters (a-z + é ü ö ê å ø etc..)

阅读更多关于 Test if string contains only letters (a-z + é ü ö ê å ø etc..)

I want to match a string to make sure it contains only letters. I've got this and it works just fine: var onlyLetters = /^[a-zA-Z]*$/.test(myString); BUT Since I speak another language too, I need to allow all letters, not just A-Z. Also for example: é ü ö ê å ø does anyone know if there is a global 'alpha' term that includes all letters to use with regExp? Or even better, does anyone have some kind of solution? Thanks alot EDIT: Just realized that you might also wanna allow '-' and ' ' incase of a double name like: 'Mary-Ann' or 'Mary Ann' I don’t know the actual reason for doing this, but if

Replacing diacritics in Javascript

阅读更多关于 Replacing diacritics in Javascript

How can I replace diacritics (ă,ş,ţ etc) with their "normal" form (a,s,t) in javascript? If you want to do it entirely on the client side, I think your only option is with some kind of lookup table. Here's a starting point, written by a chap called Olavi Ivask on his blog ... function replaceDiacritics(s) { var s; var diacritics =[ /[\300-\306]/g, /[\340-\346]/g, // A, a /[\310-\313]/g, /[\350-\353]/g, // E, e /[\314-\317]/g, /[\354-\357]/g, // I, i /[\322-\330]/g, /[\362-\370]/g, // O, o /[\331-\334]/g, /[\371-\374]/g, // U, u /[\321]/g, /[\361]/g, // N, n /[\307]/g, /[\347]/g, // C, c ]; var

Remove accents from String

阅读更多关于 Remove accents from String

Is there any way in Android that (to my knowledge) doesn't have java.text.Normalizer, to remove any accent from a String. E.g "éàù" becomes "eau". I'd like to avoid parsing the String to check each character if possible! java.text.Normalizer is there in Android (on latest versions anyway). You can use it. EDIT For reference, here is how to use Normalizer : string = Normalizer.normalize(string, Normalizer.Form.NFD); string = string.replaceAll("[^\\p{ASCII}]", ""); (pasted from the link in comments below) I've ajusted Rabi's solution to my needs, I hope it helps someone: private static Map

How to ignore acute accent in a javascript regex match?

阅读更多关于 How to ignore acute accent in a javascript regex match?

I need to match a word like 'César' for a regex like this /^cesar/i . Is there an option like /i to configure the regex so it ignores the acute accents?. Or the only solution is to use a regex like this /^césar/i . The standard ecmascript regex isn't ready for unicode (see http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode ). So you have to use an external regex library. I used this one (with the unicode plugin) in the past : http://xregexp.com/ In your case, you may have to escape the char é as \u00E9 and defining a range englobing e, é, ê, etc. EDIT : I just saw the comment

Python and character normalization

阅读更多关于 Python and character normalization

Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ? I recommend using Unidecode module : >>> from unidecode import unidecode >>> unidecode(u'ıöüç') 'iouc' Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII. It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII ( αβγ to abg ) then unidecode is the way to

Removing diacritics in Silverlight (String.Normalize issue)

阅读更多关于 Removing diacritics in Silverlight (String.Normalize issue)

问题 I did create a function that transforms diacritic characters into non-diacritic characters (based on this post) Here’s the code: Public Function RemoveDiacritics(ByVal searchInString As String) As String Dim returnValue As String = "" Dim formD As String = searchInString.Normalize(System.Text.NormalizationForm.FormD) Dim unicodeCategory As System.Globalization.UnicodeCategory = Nothing Dim stringBuilder As New System.Text.StringBuilder() For formScan As Integer = 0 To formD.Length - 1