I have some content that is generated by the Drupal CMS that contains strings like:
\"... \\n Proficient knowledge of \\x3cstrong\\x3emedical\\x3c/strong\\x3
Ok, this will do it:
/**
* Converts all UTF-8 Units ( \xXX ) back into ascii characters.
*
* @param string $input String which includes some UTF-8 units
* @return string
*/
function convertUTF8Units($input) {
include $path;
$part = "";
$output = $input;
$len = strlen($input)-4;
for($i=0; $i<=$len; $i++) {
$part = substr($input, $i, 4);
if ((substr($part, 0, 2) === "\\x")) {
$raw = hex2bin( $part );
$raw = trim($raw);
$pattern = "/\\".$part."/";
$output = preg_replace($pattern, $raw, $output);
}
}
return $output;
}
/**
* Function to convert a hex code back to ascii string. Taken from
* http://devcorner.georgievi.net/pages/programming/php/hex2bin-php.
*
* @param string $hex_string String of format: \xXX
* @return string
*/
define('HEX2BIN_WS', " \t\n\r");
function hex2bin($hex_string) {
$pos = 0;
$result = '';
while ($pos < strlen($hex_string)) {
if (strpos(HEX2BIN_WS, $hex_string{$pos}) !== FALSE) {
$pos++;
}
else {
$code = hexdec(substr($hex_string, $pos, 2));
$pos = $pos + 2;
$result .= chr($code);
}
}
return $result;
}
I'm a little fuzzy on exactly what I'm converting to what though; all I'm sure about is that it passes all the JSON validators now. While pursuing this UTF-8, UTF-8 Units, Binary somethings, Hex values and ascii characters have all come up. I can't actually articulate the difference, nor can I definitively say what the input, conversions, or output of these functions are.
Can anyone walk me through what my code is doing? :P
what about :
echo iconv('ASCII', 'UTF-8', "Proficient knowledge of \x3cstrong\x3emedical\x3c/strong\x3e terminology");
// returns Proficient knowledge of <strong>medical</strong> terminology
$jsonString = "... \n Yes \n \n \n The \x3cstrong\x3eMedical\x3c/strong\x3e Assistant performs patient screening care under the direction of the \x3cstrong\x3eMedical\x3c/strong\x3e Director/On-site provider including, but not limited to, EKG’s. ...";
$jsonString = str_replace(array('’'), array("'"), $jsonString);
echo iconv('ASCII', 'UTF8//IGNORE//TRANSLIT', nl2br($jsonString));
// returns ... <br>Yes <br><br><br>The <strong>Medical</strong> Assistant performs patient screening care under the direction of the <strong>Medical</strong> Director/On-site provider including, but not limited to, EKG's. ...
\x
usually represents hexadecimal, while \u
is for unicode. Your question has nothing to do with Unicode or unicode codepoints.
It is safe to use chr()
because \xFF
is 255
max and that is in ASCII range.
function weird_answer_to_weird_question($string)
{
return preg_replace_callback('#\\\\x([[:xdigit:]]{2})#ism', function($matches)
{
return chr(hexdec($matches[1]));
},
$string);
}
Output:
"... \n Proficient knowledge of medical terminology; typing skills at 40 wpm. Excellent communication and ... which involves access to sensitive and/or confidential medical information. Must demonstrate leadership skills in decision making and ..."
P.S.
You must also do a $string = str_replace('\n', "\n", $string);
or similar because json_encode()
will double encode that. Thanks to @netcoder for pointing it out.