Transfer and write Parquet with python and pandas got timestamp error

时光怂恿深爱的人放手 提交于 2020-05-11 05:14:05

问题


I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :

 ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:

I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:

import pandas as pd

table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')

回答1:


Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.




回答2:


I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.

The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.

import pandas as pd

table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))

table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')



回答3:


I experienced a similar problem while using pd.to_parquet, my final workaround was to use the argument engine='fastparquet', but I realize this doesn't help if you need to use PyArrow specifically.

Things I tried which did not work:

  • @DrDeadKnee's workaround of manually casting columns .astype("datetime64[ms]") did not work for me (pandas v. 0.24.2)
  • Passing coerce_timestamps='ms' as a kwarg to the underlying parquet operation did not change behaviour.


来源:https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!