PHP: Unicode accentuated char and diacritics

扶醉桌前 提交于 2021-02-15 11:44:15

问题


In our website, some Mac users have troubles when they copy-paste text from PDF files into a TextArea (handled by TinyMCE). All accentuated char are corrupted, and became for example e? for a é, i? for a î, etc. I cannot reproduce this problem with a Windows computer.

When I wrote the content of the TextArea on a file (before inserting it in the database), I just discovered that the initial is visually different that a traditionnal é (on Vim, see below).

Visual example of the problem

Indeed :

// the corrupted é - first line of the screenshot
echo bin2hex($char); // display 65cc81

// traditionnal é
echo bin2hex('é');   // display c3a9

After searching a lot, here I am : It seems that Mac OS copies Unicode accentuated chars as a combination of two chars: in our example, e + ́. So far, I didn't find any solution to replace corrupted é with the real one, to avoid e? in the database.

And I'm a little desperate.


回答1:


The process of normalizing the representation to one form or the other is called, well, normalization. In PHP there's the Normalizer class for that, sending all input through it is a good idea:

$input = Normalizer::normalize($input);

You likely want to normalize to form C, Canonical Decomposition followed by Canonical Composition.

Should that class not be available on your system, there's the Patchwork UTF-8 library.




回答2:


This is just additional to what @deceze already answered. There are multiple ways in Unicode to specify the same (in the sense of equivalence) character.

You have a common example here:

65cc81

That are two Unicode codepoints in Utf-8 encoding. 65 is e LATIN SMALL LETTER E (U+0065) and cc81 is ́ COMBINING ACUTE ACCENT (U+0301) (it can not be displayed alone by your browser, so I took the HTML entity).

In Unicode this is called a Combining sequence. For some reason however, your database does not support it. Probably because the encoding of the column is not UTF-8 or the database connection has troubles with it.

It is canonically equivalent to

c3a9

That is a single Unicode codepoint in Utf-8 encoding. c3a9 is é LATIN SMALL LETTER E WITH ACUTE (U+00E9). Looks like your database has no problem to deal with it, probably because it is re-encoded to Latin-1 / ISO-8859-1 by the database connection successfully.

So two ways of handling the data come to mind. It is either a problem in the re-encoding of the data or a problem storing the data.

As long as you're interested in de-composition of the composed unicode codepoint sequences, you should take the normalizer outlined by in Deceze's answer.

You can also allow UTF-8 to be stored into the database and then you should not have a problem, too.

Additionally you should probably normalize anyway so that sorting and comparing data in the database or your program works better. As you can see, the binary sequences differ which can cause problems for everything that compare on the binary level.

And sure, you save some traffic :)




回答3:


There is a tinymce configuration parameter which lets you define a function to process the pasted contents before insertion in the editor: paste_preprocessing

Using that function you can replace the specialchars with the desired form

tinyMCE.init({
        ...
        paste_preprocess : function(pl, o) {
            // Content string containing the HTML from the clipboard
            o.content = o.content.replace(/\u2020/, 'x'); // example
        },
        paste_postprocess : function(pl, o) {
            ...
        },
        ...
});


来源:https://stackoverflow.com/questions/13586060/php-unicode-accentuated-char-and-diacritics

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!