diacritics

Regex - match a character and all its diacritic variations (aka accent-insensitive)

断了今生、忘了曾经 提交于 2019-11-30 20:51:48
I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is: re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é") but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e . A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e re.match(r"^e$", unidecode("é")) Or in this simplified case unidecode("é") == "e" Another solution which doesn't

Character encoding for French Accents

大憨熊 提交于 2019-11-30 17:38:27
I'm developing my first website for a French client and I'm having massive issues with accents being displayed as "?".After googling it for days, I thought I understood, but issues persists. To simplify it, I'll explain just the email headers (the message contains french accents) $headers = 'MIME-Version: 1.0' . "\r\n"; $headers .= 'Content-type: text/html; charset=iso-8859-1' . "\r\n"; I've tried using charset UTF-8 and the iso-8859-1, but I still get this type of emails: Merci pour votre intérêt pour les tee shirts. Can any one help? I'm having these issues with mySql, HTML, PHP everywhere

Why doesn't Đ get flattened to D when Removing Accents/Diacritics

半世苍凉 提交于 2019-11-30 17:36:07
I'm using this method to remove accents from my strings: static string RemoveAccents(string input) { string normalized = input.Normalize(NormalizationForm.FormKD); StringBuilder builder = new StringBuilder(); foreach (char c in normalized) { if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark) { builder.Append(c); } } return builder.ToString(); } but this method leaves đ as đ and doesn't change it to d, even though d is its base char. you can try it with this input string "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ" What's so special in letter đ? The answer for why it doesn't work is that

How to deal with accented characters in iOS SQLite?

≡放荡痞女 提交于 2019-11-30 14:48:56
问题 I need to perform a SELECT queries that are insensitive to case and accents. For demo purposes, I create a table like that: create table table ( column text collate nocase ); insert into table values ('A'); insert into table values ('a'); insert into table values ('Á'); insert into table values ('á'); create index table_cloumn_Index on table (column collate nocase); Then, I get those results when executing the following queries: SELECT * FROM table WHERE column LIKE 'a'; > A > a SELECT * FROM

Python regex \\w doesn't match combining diacritics?

只谈情不闲聊 提交于 2019-11-30 13:05:51
I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics. >>> re.match("a\w\w\wz", u"aoooz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> print u"ao\u00F3oz" aoóoz >>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE) >>> print u"aoo\u0301oz" aóooz (Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there

Failing to write german 'umlauts' (äöü) from console to text file with java

懵懂的女人 提交于 2019-11-30 09:29:40
问题 currently I'm desperately trying to write german umlauts, read from the console, into a utf8 encoded text file on windows 7. Here is the code to setup the scanner: Scanner scanner = new Scanner(System.in, "UTF8"); Here is the code to read the string: String s = scanner.nextLine(); Here is the code to write into a file: OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(this.targetFile), "UTF8"); osw.write(s); Unfortunately, instead of example "überraschung" the so written

Regex - match a character and all its diacritic variations (aka accent-insensitive)

狂风中的少年 提交于 2019-11-30 05:16:32
问题 I am trying to match a character and all its possible diacritic variations (aka accent-insensitive) with a regular expression. What I could do of course is: re.match(r"^[eēéěèȅêęëėẹẽĕȇȩę̋ḕḗḙḛḝė̄]$", "é") but that is not a general solution. If I use unicode categories like \pL I can't reduce the match to a specific character, in this case e . 回答1: A workaround to achieve the desired goal would be to use unidecode to get rid of all diacritics first, and then just match agains the regular e re

Character encoding for French Accents

自闭症网瘾萝莉.ら 提交于 2019-11-30 01:38:51
问题 I'm developing my first website for a French client and I'm having massive issues with accents being displayed as "?".After googling it for days, I thought I understood, but issues persists. To simplify it, I'll explain just the email headers (the message contains french accents) $headers = 'MIME-Version: 1.0' . "\r\n"; $headers .= 'Content-type: text/html; charset=iso-8859-1' . "\r\n"; I've tried using charset UTF-8 and the iso-8859-1, but I still get this type of emails: Merci pour votre

Python regex \w doesn't match combining diacritics?

两盒软妹~` 提交于 2019-11-29 18:49:20
问题 I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics. >>> re.match("a\w\w\wz", u"aoooz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> print u"ao\u00F3oz" aoóoz >>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE) >>> print u"aoo\u0301oz" aóooz (Looks

Custom HTTP header value - trying to pass umlaut characters

青春壹個敷衍的年華 提交于 2019-11-29 16:40:08
I am using Node.js and Express.js 3.x. As one of our authorization headers we are passing in the username. Some of our usernames contain umlaut characters: ü ö ä and the likes of. For usernames with just 'normal' characters, all works fine. But when a jörg tries to make a request, the server doesn't recognize the umlaut character in the header. Trying to simulate the problem I: Created some tests that set the username header with the umlaut character. These tests pass, they are able to pass in the umlaut correctly. Used 'postman' and 'advanced rest client' Chrome extensions and made the