diacritics

Is there a way to use NSString stringByFoldingWithOptions to unfold the single French 'œ' character into 'oe'?

匆匆过客 提交于 2019-11-29 11:32:56
For a diacritics-agnostic full text search feature, I use the following code to convert accented characters like é or Ö into their lowercase non-accented form e and o [[inputString stringByFoldingWithOptions: NSCaseInsensitiveSearch + NSDiacriticInsensitiveSearch + NSWidthInsensitiveSearch locale: [NSLocale currentLocale]] lowercaseString]; This works. However, I found no way to convert special characters whose base form consists of multiple characters like the French œ (as in "sœur") or the German ß (as in 'Fluß'). I would like to convert them into oe and ss respectively. I found no flag for

MongoDB diacriticInSensitive search not showing all accented (words with diacritic mark) rows as expected and vice-versa

时光毁灭记忆、已成空白 提交于 2019-11-29 11:28:22
I have a document collection with following structure uid, name With a Index db.Collection.createIndex({name: "text"}) It contains following data 1, iphone 2, iphóne 3, iphonë 4, iphónë When I am doing text search for iphone I am getting only two records, which is unexpected actual output -------------- 1, iphone 2, iphóne If I search for iphonë db.Collection.find( { $text: { $search: "iphonë"} } ); I am getting --------------------- 3, iphonë 4, iphónë But Actually I am expecting following output db.Collection.find( { $text: { $search: "iphone"} } ); db.Collection.find( { $text: { $search:

Python: Convert Unicode to ASCII without errors for CSV file

二次信任 提交于 2019-11-29 10:43:57
I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)" buffer=cStringIO.StringIO() writer=csv.writer(buffer, csv.excel) cr.execute(query, query_param) while (1): row = cr.fetchone() writer.writerow([s.encode('ascii','ignore') for s in row]) The value of row is (56, u"LIMPIADOR BA\xd1O 1'5 L") where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I

How can I make a regular expression which takes accented characters into account?

狂风中的少年 提交于 2019-11-29 10:40:40
I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". AS3 RegExp to match words with boundry type characters in them And since \w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]).

accent insensitive regex

删除回忆录丶 提交于 2019-11-29 10:19:51
My code: jQuery.fn.extend({ highlight: function(search){ var regex = new RegExp('(<[^>]*>)|('+ search.replace(/[.+]i/,"$0") +')','ig'); return this.html(this.html().replace(regex, function(a, b, c){ return (a.charAt(0) == '<') ? a : '<strong class="highlight">' + c + '</strong>'; })); } }); I want to highlight letters with accents, ie: $('body').highlight("cao"); should highlight: [ção] OR [çÃo] OR [cáo] OR expre[cão]tion OR [Cáo]tion How can I do that? The sole correct way to do this is to first run it through Unicode Normalization Form D , canonical decomposition. You then strip our any

MySQL REGEXP query - accent insensitive search

点点圈 提交于 2019-11-29 09:42:15
I'm looking to query a database of wine names, many of which contain accents (but not in a uniform way, and so similar wines may be entered with or without accents) The basic query looks like this: SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugères[[:>:]]' which will return entries with 'Faugères' in the title, but not 'Faugeres' SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugeres[[:>:]]' does the opposite. I had thought something like: SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faug[eèêéë]r[eèêéë]s[[:>:]]' might do the trick, but this only returns the

WPF WebBrowser and special characters like german “umlaute”

戏子无情 提交于 2019-11-29 08:49:29
I use the WPF WebBrowser Control in my app. I have a file (mht) which contains german umlaute (ä ö ü). Now, I load this this file with .Navigate(path) but the Problem is, that this charactes are not shown correct. How can I solve this? Best Regards, Thomas Gavin Jones This is very quirky. My solution was to put an explicit meta tag in my HTML file - "My Page.html" <meta http-equiv='Content-Type' content='text/html;charset=UTF-8'> Then using the standard Web Browser .NET control I then created a URI object first. webBrowser1.Url = new Uri("My Page.html"); Then draw the page using the refresh

Code to strip diacritical marks using ICU

我的梦境 提交于 2019-11-29 07:30:28
Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é would become a plain ASCII e ) from a UnicodeString using the ICU library in C++? E.g.: UnicodeString strip_diacritics( UnicodeString const &s ) { UnicodeString result; // ... return result; } Assume that s has already been normalized. Thanks. ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC : decompose, remove diacritics, recompose. The

Should all accented characters use html entities?

試著忘記壹切 提交于 2019-11-29 05:55:35
I am working with a large number of HTML files that are mostly encoded as utf-8. There are accented characters galore as many are in French. I have been converting them to HTML entities as I go, but I noticed that even in IE5.5 (according IE tester) the nonconverted accented characters are displaying properly. Should I be concerned with character display and convert them all to HTML entities just to be on the safe side? If the files are UTF-8 encoded, you should set the Content-Type header to be text/html; charset=UTF-8 and have an equivalent meta tag on the page: <meta http-equiv="Content

PHP convert foreign characters with accents

末鹿安然 提交于 2019-11-29 02:39:46
Hi I'm trying to compare some text to the text in a database.. in the database any text with an accent is encoded like in html (ie. é) when I compare the database text to my string it doesn't match because my string just shows é .. when I use the php function htmlentities to encode the string first the é turns into é weird? using htmlspecialchars doesn't encode the é at all.. how would you suggest I compare é to é as well as all the other accented characters? You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it