Remove non-ascii characters from string

前端 未结 8 1280
遥遥无期
遥遥无期 2020-11-28 03:39

I\'m getting strange characters when pulling data from a website:

Â

How can I remove anything that isn\'t a non-extended ASCII character?

8条回答
  •  天涯浪人
    2020-11-28 04:19

    Kind of related, we had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

    Solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

    Normally I would do something like this:

    
    

    ... but that replaces everything that can't be translated into a question mark (?).

    So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

    =", $text);
        $text = preg_replace("/[‗≈≡]/u", "=", $text);
    
    
        // Exciting combinations    
        $text = str_replace("ыЫ", "bl", $text);
        $text = str_replace("℅", "c/o", $text);
        $text = str_replace("₧", "Pts", $text);
        $text = str_replace("™", "tm", $text);
        $text = str_replace("№", "No", $text);        
        $text = str_replace("Ч", "4", $text);                
        $text = str_replace("‰", "%", $text);
        $text = preg_replace("/[∙•]/u", "*", $text);
        $text = str_replace("‹", "<", $text);
        $text = str_replace("›", ">", $text);
        $text = str_replace("‼", "!!", $text);
        $text = str_replace("⁄", "/", $text);
        $text = str_replace("∕", "/", $text);
        $text = str_replace("⅞", "7/8", $text);
        $text = str_replace("⅝", "5/8", $text);
        $text = str_replace("⅜", "3/8", $text);
        $text = str_replace("⅛", "1/8", $text);        
        $text = preg_replace("/[‰]/u", "%", $text);
        $text = preg_replace("/[Љљ]/u", "Ab", $text);
        $text = preg_replace("/[Юю]/u", "IO", $text);
        $text = preg_replace("/[fifl]/u", "fi", $text);
        $text = preg_replace("/[зЗ]/u", "3", $text); 
        $text = str_replace("£", "(pounds)", $text);
        $text = str_replace("₤", "(lira)", $text);
        $text = preg_replace("/[‰]/u", "%", $text);
        $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
        $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);
    
    
        //2) Translation CP1252.
        $trans = get_html_translation_table(HTML_ENTITIES);
        $trans['f'] = 'ƒ';    // Latin Small Letter F With Hook
        $trans['-'] = array(
            '…',     // Horizontal Ellipsis
            '˜',      // Small Tilde
            '–'       // Dash
            );
        $trans["+"] = '†';    // Dagger
        $trans['#'] = '‡';    // Double Dagger         
        $trans['M'] = '‰';    // Per Mille Sign
        $trans['S'] = 'Š';    // Latin Capital Letter S With Caron        
        $trans['OE'] = 'Œ';    // Latin Capital Ligature OE
        $trans["'"] = array(
            '‘',  // Left Single Quotation Mark
            '’',  // Right Single Quotation Mark
            '›', // Single Right-Pointing Angle Quotation Mark
            '‚',  // Single Low-9 Quotation Mark
            'ˆ',   // Modifier Letter Circumflex Accent
            '‹'  // Single Left-Pointing Angle Quotation Mark
            );
    
        $trans['"'] = array(
            '“',  // Left Double Quotation Mark
            '”',  // Right Double Quotation Mark
            '„',  // Double Low-9 Quotation Mark
            );
    
        $trans['*'] = '•';    // Bullet
        $trans['n'] = '–';    // En Dash
        $trans['m'] = '—';    // Em Dash        
        $trans['tm'] = '™';    // Trade Mark Sign
        $trans['s'] = 'š';    // Latin Small Letter S With Caron
        $trans['oe'] = 'œ';    // Latin Small Ligature OE
        $trans['Y'] = 'Ÿ';    // Latin Capital Letter Y With Diaeresis
        $trans['euro'] = '€';    // euro currency symbol
        ksort($trans);
    
        foreach ($trans as $k => $v) {
            $text = str_replace($v, $k, $text);
        }
    
        // 3) remove 

    ,
    ... $text = strip_tags($text); // 4) & => & " => ' $text = html_entity_decode($text); // transliterate // if (function_exists('iconv')) { // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); // } // remove non ascii characters // $text = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text); return $text; } ?>

提交回复
热议问题