dask dataframe read parquet schema difference

后端未结

关注

 2  816

I do the following:

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

raw_data_df = dd.read_csv(\'dataset/nyctaxi/nyctaxi/*.


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-17 06:54
              
            
            
                                                                       
The following two numpy specs disagree

{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'int64', 'pandas_type': 'int64'}

RateCodeID: int64 


{'metadata': None, 'field_name': 'RateCodeID', 'name': 'RateCodeID', 'numpy_type': 'float64', 'pandas_type': 'float64'}

RateCodeID: double


(look carefully!)

I suggest you supply dtypes for this columns upon loading, or use astype to coerce them to floats before writing.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2020-12-17 06:56
              
            
            
                                                                       
This question gets at one of the nastier problems in Pandas and Dask, i.e., the nullability, or lack thereof, of data types. Thus, missing data can cause problems, especially with data types, such as integers, for which there is no missing data designation.

Floats and datetimes are not too bad, because they have designated null, or missing value, place holders (NaN for floating point values in numpy and NaT for datetimes in pandas) and are therefore nullable. But even those dtypes have problems in some circumstances.

The problem can arise when you read multiple CSV files (as in your case), or pull from a database, or merge a small data frame into a larger one. You can end up with partitions in which some or all values for a given field are missing. For those partitions, Dask and also Pandas will assign a dtype for the field that can accomodate the missing data indicator. In the case of integers, the new dtype will be float. That gets further transformed to double when writing to parquet.

Dask will happily list a somewhat misleading dtype for the field. But when you write to parquet, the partitions with missing data get written as something else. As in your case, the "int64" got written as "double" in at least one parquet file. Then, when you attempted to read the entire Dask dataframe, you got the ValueError as you've shown, above, because of the mismatch.

Until these problems can be resolved, you need to make sure all of your Dask fields have appropriate data in every row. For example, if you have an int64 field, then NaN values or some other non-integer representation of missing values are not going to work.

Your int64 field may have to be fixed in several steps:


Import Pandas:

import pandas as pd

Clean up the field data to float64 and Coerce missing values to NaN:

df['myint64'] = df['myint64'].map_partitions(
    pd.to_numeric,
    meta='f8',
    errors='coerce'
)

Select a sentinal value (e.g., -1.0) to substitute for NaN so that int64 will work:

df['myint64'] = df['myint64'].where(
    ~df['myint64'].isna(),
    -1.0
)

Cast your field to int64 and persist it all:

df['myint64'] = df['myint64'].astype('i8')
df = client.persist(df)

Then try the save and reread round trip.


Note: steps 1-2 are useful for fixing float64 fields.

Finally, to fix a datetime field, try this:

    df['mydateime'] = df['mydateime'].map_partitions(
        pd.to_datetime,
        meta='M8',
        infer_datetime_format=True, 
        errors='coerce'
    ).persist()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复