How to read a text file with mixed encodings in Scala or Java?

后端 未结 7 1741
日久生厌
日久生厌 2020-12-07 16:00

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the f

相关标签:
7条回答
  • 2020-12-07 16:07

    Scala's Codec has a decoder field which returns a java.nio.charset.CharsetDecoder:

    val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE)
    Source.fromFile(filename)(decoder).getLines().toList
    
    0 讨论(0)
  • 2020-12-07 16:10

    This is how I managed to do it with java:

        FileInputStream input;
        String result = null;
        try {
            input = new FileInputStream(new File("invalid.txt"));
            CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
            decoder.onMalformedInput(CodingErrorAction.IGNORE);
            InputStreamReader reader = new InputStreamReader(input, decoder);
            BufferedReader bufferedReader = new BufferedReader( reader );
            StringBuilder sb = new StringBuilder();
            String line = bufferedReader.readLine();
            while( line != null ) {
                sb.append( line );
                line = bufferedReader.readLine();
            }
            bufferedReader.close();
            result = sb.toString();
    
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch( IOException e ) {
            e.printStackTrace();
        }
    
        System.out.println(result);
    

    The invalid file is created with bytes:

    0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94
    

    Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

    With .REPLACE you see the standard unicode replacement character being used:

    //"h�ellö� wö�rld�"
    

    With .IGNORE, you see the invalid bytes ignored:

    //"hellö wörld"
    

    Without specifying .onMalformedInput, you get

    java.nio.charset.MalformedInputException: Input length = 1
        at java.nio.charset.CoderResult.throwException(Unknown Source)
        at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
        at sun.nio.cs.StreamDecoder.read(Unknown Source)
        at java.io.InputStreamReader.read(Unknown Source)
        at java.io.BufferedReader.fill(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
        at java.io.BufferedReader.readLine(Unknown Source)
    
    0 讨论(0)
  • 2020-12-07 16:11

    A simple solution would be to interpret your data stream as ASCII, ignore all non-text characters. However, you would lose even valid encoded UTF8-characters. Don't know if that is acceptable for you.

    EDIT: If you know in advance which columns are valid UTF-8, you could write your own CSV parser that can be configured which strategy to use on what column.

    0 讨论(0)
  • 2020-12-07 16:13

    I'm switching to a different codec if one fails.

    In order to implement the pattern, I got inspiration from this other stackoverflow question.

    I use a default List of codecs, and recursively go through them. If they all fail, I print out the scary bits:

    private val defaultCodecs = List(
      io.Codec("UTF-8"),
      io.Codec("ISO-8859-1")
    )
    
    def listLines(file: java.io.File, codecs:Iterable[io.Codec] = defaultCodecs): Iterable[String] = {
      val codec = codecs.head
      val fileHandle = scala.io.Source.fromFile(file)(codec)
      try {
        val txtArray = fileHandle.getLines().toList
        txtArray
      } catch {
        case ex: Exception => {
          if (codecs.tail.isEmpty) {
            println("Exception:  " + ex)
            println("Skipping file:  " + file.getPath)
            List()
          } else {
            listLines(file, codecs.tail)
          }
        }
      } finally {
        fileHandle.close()
      }
    }
    

    I'm just learning Scala, so the code may not be optimal.

    0 讨论(0)
  • 2020-12-07 16:16

    The solution for scala's Source (based on @Esailija answer):

    def toSource(inputStream:InputStream): scala.io.BufferedSource = {
        import java.nio.charset.Charset
        import java.nio.charset.CodingErrorAction
        val decoder = Charset.forName("UTF-8").newDecoder()
        decoder.onMalformedInput(CodingErrorAction.IGNORE)
        scala.io.Source.fromInputStream(inputStream)(decoder)
    }
    
    0 讨论(0)
  • 2020-12-07 16:22

    Use ISO-8859-1 as the encoder; this will just give you byte values packed into a string. This is enough to parse CSV for most encodings. (If you have mixed 8-bit and 16-bit blocks, then you're in trouble; you can still read the lines in ISO-8859-1, but you may not be able to parse the line as a block.)

    Once you have the individual fields as separate strings, you can try

    new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")
    

    to generate the string with the proper encoding (use the appropriate encoding name per field, if you know it).

    Edit: you will have to use java.nio.charset.Charset.CharsetDecoder if you want to detect errors. Mapping to UTF-8 this way will just give you 0xFFFF in your string when there's an error.

    val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder
    
    // By default will throw a MalformedInputException if encoding fails
    decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString
    
    0 讨论(0)
提交回复
热议问题