dask dataframe read parquet schema difference

余生颓废 提交于 2019-11-29 12:16:05

The following two numpy specs disagree

{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'int64', 'pandas_type': 'int64'}

RateCodeID: int64 


{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'float64', 'pandas_type': 'float64'}

RateCodeID: double

(look carefully!)

I suggest you supply dtypes for this columns upon loading, or use astype to coerce them to floats before writing.

This question gets at one of the nastier problems in Pandas and Dask, i.e., the nullability, or lack thereof, of data types. Thus, missing data can cause problems, especially with data types, such as integers, for which there is no missing data designation.

Floats and datetimes are not too bad, because they have designated null, or missing value, place holders (NaN for floating point values in numpy and NaT for datetimes in pandas) and are therefore nullable. But even those dtypes have problems in some circumstances.

The problem can arise when you read multiple CSV files (as in your case), or pull from a database, or merge a small data frame into a larger one. You can end up with partitions in which some or all values for a given field are missing. For those partitions, Dask and also Pandas will assign a dtype for the field that can accomodate the missing data indicator. In the case of integers, the new dtype will be float. That gets further transformed to double when writing to parquet.

Dask will happily list a somewhat misleading dtype for the field. But when you write to parquet, the partitions with missing data get written as something else. As in your case, the "int64" got written as "double" in at least one parquet file. Then, when you attempted to read the entire Dask dataframe, you got the ValueError as you've shown, above, because of the mismatch.

Until these problems can be resolved, you need to make sure all of your Dask fields have appropriate data in every row. For example, if you have an int64 field, then NaN values or some other non-integer representation of missing values are not going to work.

Your int64 field may have to be fixed in several steps:

  1. Import Pandas:

    import pandas as pd
    
  2. Clean up the field data to float64 and Coerce missing values to NaN:

    df['myint64'] = df['myint64'].map_partitions(
        pd.to_numeric,
        meta='f8',
        errors='coerce'
    )
    
  3. Select a sentinal value (e.g., -1.0) to substitute for NaN so that int64 will work:

    df['myint64'] = df['myint64'].where(
        ~df['myint64'].isna(),
        -1.0
    )
    
  4. Cast your field to int64 and persist it all:

    df['myint64'] = df['myint64'].astype('i8')
    df = client.persist(df)
    
  5. Then try the save and reread round trip.

Note: steps 1-2 are useful for fixing float64 fields.

Finally, to fix a datetime field, try this:

    df['mydateime'] = df['mydateime'].map_partitions(
        pd.to_datetime,
        meta='M8',
        infer_datetime_format=True, 
        errors='coerce'
    ).persist()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!