Detect encoding and make everything UTF-8

前端 未结 24 2772
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

24条回答
  •  甜味超标
    2020-11-22 03:20

    You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.


    Edit   Here is what I probably would do:

    I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

    $url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
    
    $accept = array(
        'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
        'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
    );
    $header = array(
        'Accept: '.implode(', ', $accept['type']),
        'Accept-Charset: '.implode(', ', $accept['charset']),
    );
    $encoding = null;
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
    $response = curl_exec($curl);
    if (!$response) {
        // error fetching the response
    } else {
        $offset = strpos($response, "\r\n\r\n");
        $header = substr($response, 0, $offset);
        if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
            // error parsing the response
        } else {
            if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
                // type not accepted
            }
            $encoding = trim($match[2], '"\'');
        }
        if (!$encoding) {
            $body = substr($response, $offset + 4);
            if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
                $encoding = trim($match[1], '"\'');
            }
        }
        if (!$encoding) {
            $encoding = 'utf-8';
        } else {
            if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
                // encoding not accepted
            }
            if ($encoding != 'utf-8') {
                $body = mb_convert_encoding($body, 'utf-8', $encoding);
            }
        }
        $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
        if (!$simpleXML) {
            // parse error
        } else {
            echo $simpleXML->asXML();
        }
    }
    

提交回复
热议问题