How to read a text file with mixed encodings in Scala or Java?

扶醉桌前 提交于 2019-11-28 16:19:43

This is how I managed to do it with java:

    FileInputStream input;
    String result = null;
    try {
        input = new FileInputStream(new File("invalid.txt"));
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.IGNORE);
        InputStreamReader reader = new InputStreamReader(input, decoder);
        BufferedReader bufferedReader = new BufferedReader( reader );
        StringBuilder sb = new StringBuilder();
        String line = bufferedReader.readLine();
        while( line != null ) {
            sb.append( line );
            line = bufferedReader.readLine();
        }
        bufferedReader.close();
        result = sb.toString();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch( IOException e ) {
        e.printStackTrace();
    }

    System.out.println(result);

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

Without specifying .onMalformedInput, you get

java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)

The solution for scala's Source (based on @Esailija answer):

def toSource(inputStream:InputStream): scala.io.BufferedSource = {
    import java.nio.charset.Charset
    import java.nio.charset.CodingErrorAction
    val decoder = Charset.forName("UTF-8").newDecoder()
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    scala.io.Source.fromInputStream(inputStream)(decoder)
}

Scala's Codec has a decoder field which returns a java.nio.charset.CharsetDecoder:

val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
Source.fromFile(filename)(decoder).getLines().toList

The problem with ignoring invalid bytes is then deciding when they're valid again. Note that UTF-8 allows variable-length byte encodings for characters, so if a byte is invalid, you need to understand which byte to start reading from to get a valid stream of characters again.

In short, I don't think you'll find a library which can 'correct' as it reads. I think a much more productive approach is to try and clean that data up first.

Harry Pehkonen

I'm switching to a different codec if one fails.

In order to implement the pattern, I got inspiration from this other stackoverflow question.

I use a default List of codecs, and recursively go through them. If they all fail, I print out the scary bits:

private val defaultCodecs = List(
  io.Codec("UTF-8"),
  io.Codec("ISO-8859-1")
)

def listLines(file: java.io.File, codecs:Iterable[io.Codec] = defaultCodecs): Iterable[String] = {
  val codec = codecs.head
  val fileHandle = scala.io.Source.fromFile(file)(codec)
  try {
    val txtArray = fileHandle.getLines().toList
    txtArray
  } catch {
    case ex: Exception => {
      if (codecs.tail.isEmpty) {
        println("Exception:  " + ex)
        println("Skipping file:  " + file.getPath)
        List()
      } else {
        listLines(file, codecs.tail)
      }
    }
  } finally {
    fileHandle.close()
  }
}

I'm just learning Scala, so the code may not be optimal.

A simple solution would be to interpret your data stream as ASCII, ignore all non-text characters. However, you would lose even valid encoded UTF8-characters. Don't know if that is acceptable for you.

EDIT: If you know in advance which columns are valid UTF-8, you could write your own CSV parser that can be configured which strategy to use on what column.

Use ISO-8859-1 as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)

Once you have the individual fields as separate strings, you can try

new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")

to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).

Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.

val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder

// By default will throw a MalformedInputException if encoding fails
decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!