Reading a text file from the client and on the client that exceeds the maximum size of a single string in javascript

问题

I'd like to reverse the following steps performed on the client in javascript but am having trouble with the blob.

In an indexedDB database, over an open cursor on an object store index:

Extracted data object from database.
Converted object to string with JSON.stringify.
Made new blob { type: 'text/csv' } of the JSON string.
Wrote blob to an array.
Moved cursor down one and repeated from step 1.

After the transaction completed successfully, a new blob of same type was made from the array of blobs.

The reason for doing it this way is that the concatenation of the JSON strings exceeded the maximum permitted size for a single string; so, couldn't concatenate first and make one blob of that large string. However, the array of blobs could be made into a single blob of greater size, approximately 350MB, and downloaded to the client disk.

To reverse this process, I thought I could read the blob in and then slice it into the component blobs, and then read each blob as a string; but I can't figure out how to do it.

If the FileReader is read as text, the result is one large block of text that cannot be written to a single variable because it exceeds the maximum size and throws an allocation size overflow error.

It appeared that reading the file as an array buffer would be an approach allowing for slicing the blob into pieces, but there seems to be an encoding issue of some kind.

Is there a way to reverse the orignal process as is, or an encoding step that can be added that will allow the array buffer to be converted back to the original strings?

I tried reading over some questions that appeared to be related but, at this point, I don't understand the encoding issues they were discussing. It seems that it is rather complicated to recover a string.

Thank you for any guidance you can provide.

Additional information after employing accepted answer

There's certainly nothing special about my code posted below but I figured I'd share it for those who may be as new to this as me. It is the accepted answer integrated into the asnyc function used to read the blobs, parse them, and write them to the database.

This method uses very little memory. It is too bad there isn't a way to do the same for writing the data to disk. In writing the database to disk, memory usage increases as the large blob is generated and then released shortly after the download completes. Using this method to upload the file from the local disk, appears to work without ever loading the entire blob into memory before slicing. It is as if the file is read from the disk in slices. So, it is very efficient in terms of memory usage.

In my specific case, there is still work to be done because using this to write the 50,000 JSON strings totalling 350MB back to the database is rather slow and takes about 7:30 to complete.

Right now each individual string is separately sliced, read as text, and written to the database in a single transaction. Whether slicing the blob into larger pieces comprised of a set of JSON strings, reading them as text in a block, and then writting them to the database in a single transaction, will perform more quickly while still not using a large amount of memory is something I will need to experiment with and a topic for a separate question.

If use the alternative loop that determines the number of JSON strings needed to fill the size const c, and then slice that size blob, read it as text, and split it up to parse each individual JSON string, the time to complete is about 1:30 for c =250,000 through 1,000,000. It appears that parsing a large number of JSON strings still slows things down regardless. Large blob slices don't translate to large amounts of text being parsed as a single block and each of the 50,000 strings needs to be parsed individually.

   try

     {

       let i, l, b, result, map, p;

       const c = 1000000;


       // First get the file map from front of blob/file.

       // Read first ten characters to get length of map JSON string.

       b = new Blob( [ f.slice(0,10) ], { type: 'text/csv' } ); 

       result = await read_file( b );

       l = parseInt(result.value);


       // Read the map string and parse to array of objects.

       b = new Blob( [ f.slice( 10, 10 + l) ], { type: 'text/csv' } ); 

       result = await read_file( b );

       map = JSON.parse(result.value); 


       l = map.length;

       p = 10 + result.value.length;


       // Using this loop taks about 7:30 to complete.

       for ( i = 1; i < l; i++ )

         {

           b = new Blob( [ f.slice( p, p + map[i].l ) ], { type: 'text/csv' } ); 

           result = await read_file( b ); // FileReader wrapped in a promise.

           result = await write_qst( JSON.parse( result.value ) ); // Database transaction wrapped in a promise.

           p = p + map[i].l;

           $("#msg").text( result );

         }; // next i


       $("#msg").text( "Successfully wrote all data to the database." );


       i = l = b = result = map = p = null;

     }

   catch(e)

     { 

       alert( "error " + e );

     }

   finally

     {

       f = null;

     }



/* 

  // Alternative loop that completes in about 1:30 versus 7:30 for above loop.


       for ( i = 1; i < l; i++ )

         { 

           let status = false, 

               k, j, n = 0, x = 0, 

               L = map[i].l,

               a_parse = [];



           if ( L < c ) status = true;

           while ( status )

             {

               if ( i+1 < l && L + map[i+1].l <= c ) 

                 {

                   L = L + map[i+1].l;

                   i = i + 1;

                   n = n + 1;

                 }

               else

                 {

                   status = false;

                 };

             }; // loop while


           b = new Blob( [ f.slice( p, p + L ) ], { type: 'text/csv' } ); 

           result = await read_file( b ); 

           j = i - n; 

           for ( k = j; k <= i; k++ )

             {

                a_parse.push( JSON.parse( result.value.substring( x, x + map[k].l ) ) );

                x = x + map[k].l;

             }; // next k

           result = await write_qst_grp( a_parse, i + ' of ' + l );

           p = p + L;

           $("#msg").text( result );

         }; // next i



*/



/*

// Was using this loop when thought the concern may be that the JSON strings were too large,
// but then realized the issue in my case is the opposite one of having 50,000 JSON strings of smaller size.

       for ( i = 1; i < l; i++ )

         {

           let x,

               m = map[i].l,

               str = [];

           while ( m > 0 )

             {

               x = Math.min( m, c );

               m = m - c;

               b = new Blob( [ f.slice( p, p + x ) ], { type: 'text/csv' } ); 

               result = await read_file( b );

               str.push( result.value );

               p = p + x;

             }; // loop while


            result = await write_qst( JSON.parse( str.join("") ) );

            $("#msg").text( result );

            str = null;

         }; // next i
*/

回答1:

Funnilly enough you already said in your question what should be done:

Slice your Blob.

The Blob interface does have a .slice() method.
But to use it, you should keep track of the positions where your merging occurred. (could be in an other field of your db, or even as an header of your file:

function readChunks({blob, chunk_size}) {
  console.log('full Blob size', blob.size);
  const strings = [];  
  const reader = new FileReader();
  var cursor = 0;
  reader.onload = onsingleprocessed;
  
  readNext();
  
  function readNext() {
    // here is the magic
    const nextChunk = blob.slice(cursor, (cursor + chunk_size));
    cursor += chunk_size;
    reader.readAsText(nextChunk);
  }
  function onsingleprocessed() {
    strings.push(reader.result);
    if(cursor < blob.size) readNext();
    else {
      console.log('read %s chunks', strings.length);
      console.log('excerpt content of the first chunk',
        strings[0].substring(0, 30));
    }
  }
}



// we will do the demo in a Worker to not kill visitors page
function worker_script() {
  self.onmessage = e => {
    const blobs = [];
    const chunk_size = 1024*1024; // 1MB per chunk
    for(let i=0; i<500; i++) {
      let arr = new Uint8Array(chunk_size);
      arr.fill(97); // only 'a'
      blobs.push(new Blob([arr], {type:'text/plain'}));
    }
    const merged = new Blob(blobs, {type: 'text/plain'});
    self.postMessage({blob: merged, chunk_size: chunk_size});
  }
}
const worker_url = URL.createObjectURL(
  new Blob([`(${worker_script.toString()})()`],
    {type: 'application/javascript'}
  )
);
const worker = new Worker(worker_url);
worker.onmessage = e => readChunks(e.data);
worker.postMessage('do it');

来源：https://stackoverflow.com/questions/50981075/reading-a-text-file-from-the-client-and-on-the-client-that-exceeds-the-maximum-s

标签

javascript

blob

arraybuffer