MalformedInputException when trying to read entire file

问题

I have a 132 kb file (you can't really say it's big) and I'm trying to read it from the Scala REPL, but I can't read past 2048 char because it gives me a java.nio.charset.MalformedInputException exception

These are the steps I take:

val it = scala.io.Source.fromFile("docs/categorizer/usig_calles.json") // this is ok
it.take(2048).mkString // this is ok too
it.take(1).mkString // BANG!

java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)

Any idea what could be wrong?

Apparently the problem was that the file was not UTF encoded

I saved it as UTF and everything works, I just issue mkString on the iterator and it retrieves the whole contents of the file

The strange thing is that the error only aroused passing the first 2048 chars...

回答1:

Cannot be certain without the file, but the documentation on the exception indicates it is thrown "when an input byte sequence is not legal for given charset, or an input character sequence is not a legal sixteen-bit Unicode sequence." (MalformedInputException javadoc)

I suspect that at 2049 is the first character encountered that is not valid with whatever the default JVM character encoding is in you environment. Consider explicitly stating the character encoding of the file using one of the overloads to fromFile.

If the application will be cross platform, you should know that the default character encoding on the JVM does vary by platform, so if you operate with a specific encoding you either want to explicitly set it as a command line parameter when launching of your application, or specify it at each call using the appropriate overload.

回答2:

Any time you call take twice on the same iterator, all bets are off. Iterators are inherently imperative, and mixing them with functional idioms is dicey at best. Most of the iterators you come across in the standard library tend to be fairly well-behaved in this respect, but once you've used take, or drop, or filter, etc., you're in undefined-behavior land, and in principle anything could happen.

From the docs:

It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext ...

def take(n: Int): Iterator[A] ...

Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.

So it's probably not worth trying to track down exactly what went wrong here.

回答3:

If you just wish to convert the bytes to plain Latin data:

// File:
io.Source.fromFile(file)(io.Codec.ISO8859).mkString

// InputStream:
io.Source.fromInputStream(System.io)(io.Codec.ISO8859).mkString

来源：https://stackoverflow.com/questions/13327536/malformedinputexception-when-trying-to-read-entire-file

标签

file

scala

file-io