apache beam 2.7.0 craches in utf-8 decoding french characters

旧巷老猫 提交于 2020-01-30 05:24:40

问题


I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.

After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...

The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "utf-8" # Default value set by _PyUnicode_Init()
    sys.setdefaultencoding("utf-8")

I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!

string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)

with  open('./test.txt', 'w') as outfile:
    outfile.write(test_decode)

But no luck with apache_beam...

Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode

def decode(input, errors='ignore'):
    return codecs.utf_8_decode(input, errors, True)

but I've realized that apache_beam do not use this file or at least does not take it into account any changes

Any ideas how to deal with it?

Please find below the error message

Traceback (most recent call last):
  File "etablissementsFiness.py", line 146, in <module>
    dataflow(run_locally)
  File "etablissementsFiness.py", line 140, in dataflow
    | 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
  File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
    self.run().wait_until_finish()
  File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
    work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    def start(self):
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    with self.scoped_start_state:
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    with self.spec.source.reader() as reader:
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
    for value in reader:
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
    yield self._coder.decode(record)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
    return value.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte

回答1:


Try to write a CustomCoder class and "ignore" any errors while decoding:

from apache_beam.coders.coders import Coder

class CustomCoder(Coder):
    """A custom coder used for reading and writing strings as UTF-8."""

    def encode(self, value):
        return value.encode("utf-8", "replace")

    def decode(self, value):
        return value.decode("utf-8", "ignore")

    def is_deterministic(self):
        return True

Then, read and write the files using the coder=CustomCoder():

lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())

# More processing code here...

output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())



回答2:


This error: "UnicodeDecodeError: 'utf8' codec can't decode byte" means, that you CSV file still contains some wrong bytes not recognized by the decoder as UTF characters.

The easiest solution for this, is to convert and validate csv input file to not contain UTF8 errors before submitting for Datastore. Simple online UTF8 validation can check it.

If you need to convert latin-1 to UTF8 in python, you can do it like that:

string.decode('iso-8859-1').encode('utf8')


来源:https://stackoverflow.com/questions/52853497/apache-beam-2-7-0-craches-in-utf-8-decoding-french-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!