Recover ArrayBuffer from xhr.responseText

痴心易碎 提交于 2020-01-04 06:04:14

问题


I need to get an array buffer from an http request sending me a base64 answer. For this request, I can't use XMLHttpRequest.responseType="arraybuffer".

The response I get from this request is read through xhr.responseText. Hence it's encoded as a DOMString. I'm trying to get it back as an array buffer.

I've tried to go back to the base64 from the DOMString using btoa(mysString) or window.btoa(unescape(encodeURIComponent(str))) but the first option just fails, whereas the second option doesn't give the same base64. Example of the first few characters from each base64:

Incoming : UEsDBBQACAgIACp750oAAAAAAAAAAAAAAAALAAAAX3JlbHMvLnJlbH

After the second processing: UEsDBBQACAgIAO+/ve+/ve+/vUoAAAAAAAAAAAAAAAALAAAAX3JlbHMvLnJlbH

As you can see a part of it is similar, but some parts are way off. What am I missing to get it right?


回答1:


I have got same issue too.

The solution (I ran at Chrome(68.0.3440.84))

let url = 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=='

let iso_8859_15_table = { 338: 188, 339: 189, 352: 166, 353: 168, 376: 190, 381: 180, 382: 184, 8364: 164 }

function iso_8859_15_to_uint8array(iso_8859_15_str) {
    let buf = new ArrayBuffer(iso_8859_15_str.length);
    let bufView = new Uint8Array(buf);
    for (let i = 0, strLen = iso_8859_15_str.length; i < strLen; i++) {
        let octet = iso_8859_15_str.charCodeAt(i);
        if (iso_8859_15_table.hasOwnProperty(octet))
            octet = iso_8859_15_table[octet]
        bufView[i] = octet;
        if(octet < 0 || 255 < octet)
            console.error(`invalid data error`)
    }
    return bufView
}

req = new XMLHttpRequest();
req.overrideMimeType('text/plain; charset=ISO-8859-15');
req.onload = () => {
    console.log(`Uint8Array : `)
    var uint8array = iso_8859_15_to_uint8array(req.responseText)
    console.log(uint8array)
}
req.open("get", url);
req.send();

Below is explanation what I learned to solve it.

Explanation

Why some parts are way off?

because TextDecoder cause data loss (Your case is utf-8).

For example, let's talk about UTF-8

  • variable width character encoding for Unicode.

  • It has rules(This will become problem.) for reasons such as variable length characteristics and ASCII compatibility, etc.

  • so, decoder may replace a non-conforming characters to replacement character such as U+003F(?, Question mark) or U+FFFD(�, Unicode replacement character).

  • in utf-8 case, 0~127 of values are stable, 128~255 of values are unstable. 128~255 will converted to U+FFFD

Are other Text Decoders safe except UTF-8?

No. In most cases, not safe from rules.

UTF-8 is also unrecoverable. (128~255 are set to U+FFFD)

If the binary data and the decoded result can be corresponded to one-to-one, they can be recovered.

How to solve it?

  1. Finds recoverable Text Decoders.
  2. Force MIME type to recoverable charset of the incoming data. xhr_object.overrideMimeType('text/plain; charset=ISO-8859-15')
  3. Recover binary data from string with recover table when received.

Finds recoverable Text Decoders.

To recover, avoid the situation when decoded results' are duplicated.

The following code is a simple example, so there may be missing recoverable text decoders because it only consider Uint8Array.

let bufferView = new Uint8Array(256);
for (let i = 0; i < 256; i++)
    bufferView[i] = i;

let recoverable = []
let decoding = ['utf-8', 'ibm866', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-8i', 'iso-8859-10', 'iso-8859-13', 'iso-8859-14', 'iso-8859-15', 'iso-8859-16', 'koi8-r', 'koi8-u', 'macintosh', 'windows-874', 'windows-1250', 'windows-1251', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'x-mac-cyrillic', 'gbk', 'gb18030', 'hz-gb-2312', 'big5', 'euc-jp', 'iso-2022-jp', 'shift-jis', 'euc-kr', 'iso-2022-kr', 'utf-16be', 'utf-16le', 'x-user-defined', 'ISO-2022-CN', 'ISO-2022-CN-ext']
for (let dec of decoding) {
    try {
        let decodedText = new TextDecoder(dec).decode(bufferView);
        let loss = 0
        let recoverTable = {}
        let unrecoverable = 0
        for (let i = 0; i < decodedText.length; i++) {
            let charCode = decodedText.charCodeAt(i)
            if (charCode != i)
                loss++

            if (!recoverTable[charCode])
                recoverTable[charCode] = i
            else
                unrecoverable++
        }
        let tableCnt = 0
        for (let props in recoverTable) {
            tableCnt++
        }
        if (tableCnt == 256 && unrecoverable == 0){
            recoverable.push(dec)
            setTimeout(()=>{
                console.log(`[${dec}] : err(${loss}/${decodedText.length}, ${Math.round(loss / decodedText.length * 100)}%) alive(${tableCnt}) unrecoverable(${unrecoverable})`)
            },10)
        }
        else {
            console.log(`!! [${dec}] : err(${loss}/${decodedText.length}, ${Math.round(loss / decodedText.length * 100)}%) alive(${tableCnt}) unrecoverable(${unrecoverable})`)
        }
    } catch (e) {
        console.log(`!! [${dec}] : not supported.`)
    }
}

setTimeout(()=>{
    console.log(`recoverable Charset : ${recoverable}`)
}, 10)

In my console, this return

recoverable Charset : ibm866,iso-8859-2,iso-8859-4,iso-8859-5,iso-8859-10,iso-8859-13,iso-8859-14,iso-8859-15,iso-8859-16,koi8-r,koi8-u,macintosh,windows-1250,windows-1251,windows-1252,windows-1254,windows-1256,windows-1258,x-mac-cyrillic,x-user-defined

And I used iso-8859-15 at beginning of this answer. (It has Smallest table size.)


Additional test) Comparison between UTF-8's and ISO-8859-15's result

Check U+FFFD is really disappeared when using ISO-8859-15.

function requestAjax(url, charset) {
    let req = new XMLHttpRequest();
    if (charset)
        req.overrideMimeType(`text/plain; charset=${charset}`);
    else
        charset = 'utf-8';
    req.open('get', url);
    req.onload = () => {
        console.log(`==========\n${charset}`)
        console.log(`${req.responseText.split('', 50)}\n==========`);
        console.log('\n')
    }
    req.send();
}

var url = 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg==';
requestAjax(url, 'ISO-8859-15');
requestAjax(url);

Bottom line

  • Recover binary data to, from string needs some additional job.
    • Find recoverable text encoder/decoder.
    • Make a recover table
    • Recover with the table.
    • (You can refer to the very top of code.)
  • For use this trick, force MIME type of incoming data to desired charset.


来源:https://stackoverflow.com/questions/44974003/recover-arraybuffer-from-xhr-responsetext

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!