问题
I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.
After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...
The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
sys.setdefaultencoding("utf-8")
I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!
string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)
with open('./test.txt', 'w') as outfile:
outfile.write(test_decode)
But no luck with apache_beam...
Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode
def decode(input, errors='ignore'):
return codecs.utf_8_decode(input, errors, True)
but I've realized that apache_beam do not use this file or at least does not take it into account any changes
Any ideas how to deal with it?
Please find below the error message
Traceback (most recent call last):
File "etablissementsFiness.py", line 146, in <module>
dataflow(run_locally)
File "etablissementsFiness.py", line 140, in dataflow
| 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
self.run().wait_until_finish()
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
def start(self):
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.scoped_start_state:
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.spec.source.reader() as reader:
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
for value in reader:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
yield self._coder.decode(record)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
return value.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte
回答1:
Try to write a CustomCoder class and "ignore" any errors while decoding:
from apache_beam.coders.coders import Coder
class CustomCoder(Coder):
"""A custom coder used for reading and writing strings as UTF-8."""
def encode(self, value):
return value.encode("utf-8", "replace")
def decode(self, value):
return value.decode("utf-8", "ignore")
def is_deterministic(self):
return True
Then, read and write the files using the coder=CustomCoder()
:
lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())
# More processing code here...
output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())
回答2:
This error: "UnicodeDecodeError: 'utf8' codec can't decode byte" means, that you CSV file still contains some wrong bytes not recognized by the decoder as UTF characters.
The easiest solution for this, is to convert and validate csv input file to not contain UTF8 errors before submitting for Datastore. Simple online UTF8 validation can check it.
If you need to convert latin-1 to UTF8 in python, you can do it like that:
string.decode('iso-8859-1').encode('utf8')
来源:https://stackoverflow.com/questions/52853497/apache-beam-2-7-0-craches-in-utf-8-decoding-french-characters