Blob charset for CSV file

杀马特。学长 韩版系。学妹 提交于 2020-11-29 09:58:33

问题


I would like to create a CSV file using blob. The file should be encoded in ANSI, but it doesn't work.

var blob = new Blob(["\ufeff", csvFile], { type: 'text/csv;charset=windows-1252;' });

The file is always created with UTF-8 encoding.


回答1:


Passing an USVString (or a JavaScript string) to the Blob's constructor will automatically encode it to UTF-8 in the Blob's data.

The type option is only used by resource fetchers, it serves to mimick the Content-Type header of an HTTP request.
So for instance if you fetch or serve that Blob through a blob:// URI, this type's value will be used, similarly if you call a FileReader's readAsText( blob ) method without the encoding second parameter, the charset= info there may be used.

But this type option doesn't change at all the content of the Blob's data.

(async ()=> {

  const data = "é";
  const no_type = new Blob( [ data ] );
  const csv_windows1252 = new Blob( [ data ], { type: "text/csv;charset=Windows-1252" } );
  const image_png = new Blob( [ data ], { type: "image/png" } );

  // read as ArrayBuffer to see the exact binary content
  console.log( "no_type:", await hexDump( no_type ) ); // C3A9
  console.log( "csv_windows1252:", await hexDump( csv_windows1252 ) ); // C3A9
  console.log( "image_png:", await hexDump( image_png ) ); // C3A9

})();

async function hexDump( blob ) {
  const buf = await blob.arrayBuffer();
  const view = new Uint8Array( buf );
  const arr = [ ... view ];
  return arr.map( (val) => val.toString( 16 ) )
    .join( "" ).toUpperCase();
}

As you can see in this snippet, no matter the type parameter, all these Blobs hold exactly the same bytes data: C3 A9 which corresponds to the UTF-8 representation of an é (U+00e9) character "UTF-8 (hex) 0xC3 0xA9 (c3a9)".
In ANSI (Windows-1252), this character is represented by the byte 0xe9 (e9), so if our Blob did hold our text encoded in ANSI, it should contain this byte.

One way to see it is to use a TextDecoder and try to decode both Blobs using both encodings:

const UTF8Content = new Uint8Array( [ 0xC3, 0xA9 ] );
const ANSIContent = new Uint8Array( [ 0xE9 ] );

const UTF8Decoder = new TextDecoder( "utf-8" );
const ANSIDecoder = new TextDecoder( "windows-1252" );

console.log( "UTF8-content decoded as UTF8",
  UTF8Decoder.decode( UTF8Content )
); // é
console.log( "UTF8-content decoded as ANSI",
  ANSIDecoder.decode( UTF8Content )
); // é
console.log( "ANSI-content decoded as UTF8",
  UTF8Decoder.decode( ANSIContent )
); // �
console.log( "ANSI-content decoded as ANSI",
  ANSIDecoder.decode( ANSIContent )
); // é

So for what you wish, you'd need to generate your Blobs from a TypedArray that contains your data already encoded in ANSI.
There used to be an option using the TextEncoder API to encode from USVStrings to arbitrary encodings, but this has been removed from specs and browsers.

So the easiest is to use a library in order to perform the conversion. Here, I'll use this one:

const text = "é";
const data = new TextEncoder( "windows-1252", {
  NONSTANDARD_allowLegacyEncoding: true
} ).encode( text ); // now `data` is an Uint8Array

const blob = new Blob( [ "foo bar" ], { type: "text/csv" } ); // here you have your ANSI Blob

// Just to be sure
hexDump( blob ).then( console.log ); // E9

async function hexDump( blob ) {
  const buf = await blob.arrayBuffer();
  const view = new Uint8Array( buf );
  const arr = [ ...view ];
  return arr.map( (val) => val.toString( 16 ) )
    .join( "" ).toUpperCase();
}
<script>
  // we need to force installation of the library
  // by removing the built-in API
  window.TextEncoder = null;
</script>
<script src="https://cdn.jsdelivr.net/gh/inexorabletash/text-encoding/lib/encoding-indexes.js"></script>
<script src="https://cdn.jsdelivr.net/gh/inexorabletash/text-encoding/lib/encoding.js"></script>

As a fiddle with the download link since stack-snippets aren't allowed to do so anymore.


Important note:

ANSI supports only a limited set of characters, some characters that can be held in an USVString can't be mapped to ANSI, so you must be sure your input will only contain mappable characters or it will throw:

const text = "😱"; // can't be mapped to ANSI
const data = new TextEncoder( "windows-1252", {
  NONSTANDARD_allowLegacyEncoding: true
} ).encode( text ); // throws
<script>
  window.TextEncoder = null;
</script>
<script src="https://cdn.jsdelivr.net/gh/inexorabletash/text-encoding/lib/encoding-indexes.js"></script>
<script src="https://cdn.jsdelivr.net/gh/inexorabletash/text-encoding/lib/encoding.js"></script>

Ps: the \uFFFE character you did prepend to the Blob's data in your code is an UTF-16 BOM. It only helps the reader know the intended endianness of the data for UTF-16 encoded text, it doesn't encode the following data in any way, and it doesn't help at all in non UTF-16 files.



来源:https://stackoverflow.com/questions/56092285/blob-charset-for-csv-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!