Strange behaviour of mb_detect_order() in PHP

前端 未结 4 770
梦毁少年i
梦毁少年i 2021-01-05 10:31

I would like to detect encoding of some text (using PHP). For that purpose i use mb_detect_encoding() function.

The problem is that the function returns different re

4条回答
  •  谎友^
    谎友^ (楼主)
    2021-01-05 10:56

    mb_detect_encoding looks at the first charset entry in your mb_detect_order() and then loops through your input $html matching character by character whether that character falls within the valid set of characters for the charset. If every character matches, then it returns true; if any character fails, it moves on to the next charset in the mb_detect_order() and tries again.

    The wikipedia list of charsets is a good place to see the characters that make up each charset.

    Because these charset values overlap (char x8fA1EF exists in both 'UTF-8' and in 'EUC-JP') this will be considered a match even though it's a totally different character in each character set. So unless any of the character values exist in one charset, but not in another, then mb_detect_encoding can't identify which of the charsets is invalid; and will return the first charset from your array list which could be valid.

    As far as I'm aware, there is no surefire way of identifying a charset. PHP's "best guess" method can be helped if you have a reasonable idea of what charsets you are likely to encounter, and order your list accordingly based on the gaps (invalid characters) in each charset. The best solution is to "know" the charset. If you are scraping your html from another page, look for the charset identifier in the header of that page.

    If you really want to be clever, you can try and identify the language in which the html is written, perhaps using trigrams or n-grams or similar as described in this article on PHP/ir.

提交回复
热议问题