How to skip invalid characters in XML file using PHP

后端 未结 7 1757
再見小時候
再見小時候 2020-12-01 05:51

I\'m trying to parse an XML file using PHP, but I get an error message:

parser error : Char 0x0 out of allowed range in

I think it\'

相关标签:
7条回答
  • 2020-12-01 06:03

    Make sure your XML source is valid. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

    0 讨论(0)
  • 2020-12-01 06:06

    Not a php solution but, it works:

    Download Notepad++ https://notepad-plus-plus.org/

    Open your .xml file in Notepad++

    From Main Menu: Search -> Search Mode set this to: Extended

    Then,

    Replace -> Find what \x00; Replace with {leave empty}

    Then, Replace_All

    Rob

    0 讨论(0)
  • 2020-12-01 06:13

    My problem was "&" character (HEX 0x24), i changed to:

    function stripInvalidXml($value)
    {
        $ret = "";
        $current;
        if (empty($value)) 
        {
            return $ret;
        }
    
        $length = strlen($value);
        for ($i=0; $i < $length; $i++)
        {
            $current = ord($value{$i});
            if (($current == 0x9) ||
                ($current == 0xA) ||
                ($current == 0xD) ||
    
                (($current >= 0x28) && ($current <= 0xD7FF)) ||
                (($current >= 0xE000) && ($current <= 0xFFFD)) ||
                (($current >= 0x10000) && ($current <= 0x10FFFF)))
            {
                $ret .= chr($current);
            }
            else
            {
                $ret .= " ";
            }
        }
        return $ret;
    }
    
    0 讨论(0)
  • 2020-12-01 06:17

    Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[ .. ]]> blocks.

    And you also need to clear the invalid characters:

    /**
     * Removes invalid XML
     *
     * @access public
     * @param string $value
     * @return string
     */
    function stripInvalidXml($value)
    {
        $ret = "";
        $current;
        if (empty($value)) 
        {
            return $ret;
        }
    
        $length = strlen($value);
        for ($i=0; $i < $length; $i++)
        {
            $current = ord($value{$i});
            if (($current == 0x9) ||
                ($current == 0xA) ||
                ($current == 0xD) ||
                (($current >= 0x20) && ($current <= 0xD7FF)) ||
                (($current >= 0xE000) && ($current <= 0xFFFD)) ||
                (($current >= 0x10000) && ($current <= 0x10FFFF)))
            {
                $ret .= chr($current);
            }
            else
            {
                $ret .= " ";
            }
        }
        return $ret;
    }
    
    0 讨论(0)
  • 2020-12-01 06:19

    I decided to test all UTF-8 values (0-1114111) to make sure things work as they should. Using preg_replace() causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.

    $utf_8_range = range(0, 1114111);
    $output = ords_to_utfstring($utf_8_range);
    $sanitized = sanitize_for_xml($output);
    
    
    /**
     * Removes invalid XML
     *
     * @access public
     * @param string $value
     * @return string
     */
    function sanitize_for_xml($input) {
      // Convert input to UTF-8.
      $old_setting = ini_set('mbstring.substitute_character', '"none"');
      $input = mb_convert_encoding($input, 'UTF-8', 'auto');
      ini_set('mbstring.substitute_character', $old_setting);
    
      // Use fast preg_replace. If failure, use slower chr => int => chr conversion.
      $output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);
      if (is_null($output)) {
        // Convert to ints.
        // Convert ints back into a string.
        $output = ords_to_utfstring(utfstring_to_ords($input), TRUE);
      }
      return $output;
    }
    
    /**
     * Given a UTF-8 string, output an array of ordinal values.
     *
     * @param string $input
     *   UTF-8 string.
     * @param string $encoding
     *   Defaults to UTF-8.
     *
     * @return array
     *   Array of ordinal values representing the input string.
     */
    function utfstring_to_ords($input, $encoding = 'UTF-8'){
      // Turn a string of unicode characters into UCS-4BE, which is a Unicode
      // encoding that stores each character as a 4 byte integer. This accounts for
      // the "UCS-4"; the "BE" prefix indicates that the integers are stored in
      // big-endian order. The reason for this encoding is that each character is a
      // fixed size, making iterating over the string simpler.
      $input = mb_convert_encoding($input, "UCS-4BE", $encoding);
    
      // Visit each unicode character.
      $ords = array();
      for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {
        // Now we have 4 bytes. Find their total numeric value.
        $s2 = mb_substr($input, $i, 1, "UCS-4BE");
        $val = unpack("N", $s2);
        $ords[] = $val[1];
      }
      return $ords;
    }
    
    /**
     * Given an array of ints representing Unicode chars, outputs a UTF-8 string.
     *
     * @param array $ords
     *   Array of integers representing Unicode characters.
     * @param bool $scrub_XML
     *   Set to TRUE to remove non valid XML characters.
     *
     * @return string
     *   UTF-8 String.
     */
    function ords_to_utfstring($ords, $scrub_XML = FALSE) {
      $output = '';
      foreach ($ords as $ord) {
        // 0: Negative numbers.
        // 55296 - 57343: Surrogate Range.
        // 65279: BOM (byte order mark).
        // 1114111: Out of range.
        if (   $ord < 0
            || ($ord >= 0xD800 && $ord <= 0xDFFF)
            || $ord == 0xFEFF
            || $ord > 0x10ffff) {
          // Skip non valid UTF-8 values.
          continue;
        }
        // 9: Anything Below 9.
        // 11: Vertical Tab.
        // 12: Form Feed.
        // 14-31: Unprintable control codes.
        // 65534, 65535: Unicode noncharacters.
        elseif ($scrub_XML && (
                   $ord < 0x9
                || $ord == 0xB
                || $ord == 0xC
                || ($ord > 0xD && $ord < 0x20)
                || $ord == 0xFFFE
                || $ord == 0xFFFF
                )) {
          // Skip non valid XML values.
          continue;
        }
        // 127: 1 Byte char.
        elseif ( $ord <= 0x007f) {
          $output .= chr($ord);
          continue;
        }
        // 2047: 2 Byte char.
        elseif ($ord <= 0x07ff) {
          $output .= chr(0xc0 | ($ord >> 6));
          $output .= chr(0x80 | ($ord & 0x003f));
          continue;
        }
        // 65535: 3 Byte char.
        elseif ($ord <= 0xffff) {
          $output .= chr(0xe0 | ($ord >> 12));
          $output .= chr(0x80 | (($ord >> 6) & 0x003f));
          $output .= chr(0x80 | ($ord & 0x003f));
          continue;
        }
        // 1114111: 4 Byte char.
        elseif ($ord <= 0x10ffff) {
          $output .= chr(0xf0 | ($ord >> 18));
          $output .= chr(0x80 | (($ord >> 12) & 0x3f));
          $output .= chr(0x80 | (($ord >> 6) & 0x3f));
          $output .= chr(0x80 | ($ord & 0x3f));
          continue;
        }
      }
      return $output;
    }
    

    And to do this on a simple object or array

    // Recursive sanitize_for_xml.
    function recursive_sanitize_for_xml(&$input){
      if (is_null($input) || is_bool($input) || is_numeric($input)) {
        return;
      }
      if (!is_array($input) && !is_object($input)) {
        $input = sanitize_for_xml($input);
      }
      else {
        foreach ($input as &$value) {
          recursive_sanitize_for_xml($value);
        }
      }
    }
    
    0 讨论(0)
  • 2020-12-01 06:22

    If you have control over the data, ensure that it is encoded correctly (i.e. is in the encoding that you promised in the xml tag, e.g. if you have:

    <?xml version="1.0" encoding="UTF-8"?>
    

    then you'll need to ensure your data is in UTF-8.

    If you don't have control over the data, yell at those who do.

    You can use a tool like xmllint to check which part(s) of the data are not valid.

    0 讨论(0)
提交回复
热议问题