问题
I am quite troubled by the behaviour of Pandas DataFrame about Dtype detection.
I use 'read_sql_query' to retrieve data from a database to build a DataFrame, and then dump it into a csv file.
I don't need any transformation. Just dump it into a csv file and change date fields in the form : '%d/%m/%Y'
However :
self.dataframe.to_csv(self.fic,
index=False,
header=False,
sep='|',
mode='a',
encoding='utf-8',
line_terminator='\n',
date_format='%d/%m/%Y
)
Would miss to transforme/format some date fields...
I tried to do it another way :
l = list(self.dataframe.select_dtypes(include=['datetime64']).columns)
for i in l:
self.dataframe[i] = self.dataframe[i].dt.strftime('%d/%m/%Y')
I was about to be satisfied, but some more tests showed a weird behaviour :
if my sql request only selects two nuplets :
requete = 'select * from DOMMAGE_INTERET where doi_id in (176433, 181564)'
Everything works, whatever formating in the csv or in the DataFrame.
It detects date fields properly :
df.dtypes
doi_id int64
aff_id int64
pdo_id int64
doi_date_decision datetime64[ns]
doi_date_mod datetime64[ns]
doi_montant float64
doi_reste_a_payer object
doi_appliquer_taux int64
doi_date_update datetime64[ns]
afg_id int64
dtype: object
But when using a different selection :
requete = 'select * from DOMMAGE_INTERET where rownum < 100'
It miss again. And actually, fields types are detected differently :
doi_id int64
aff_id int64
pdo_id int64
doi_date_decision object
doi_date_mod datetime64[ns]
doi_montant float64
doi_reste_a_payer object
doi_appliquer_taux int64
doi_date_update datetime64[ns]
afg_id int64
dtype: object
As you can see : 'doi_date_decision' type does change depending of the request selection but, of course, this is the same set of data !!!
Isn't it weird?
Do you have a explanation for this behaviour?
回答1:
Your to-csv operation does not convert all specified date fields because as you mention, not all datetime columns are read in as datetime format but show as string (object dtype) in current dataframe. This is the unfortunate side effect of reading from external sources as the imported system --this includes Python, SAS, Stata, R, Excel, etc.-- attempts to define columns usually by first few rows unless otherwise explicitly defined.
Fortunately, pandas's read_sql_query() maintains a parameter for parse_dates. So consider defining the dates during the read in operation as this argument takes a list or dictionary:
df = read_sql_query('select * from DOMMAGE_INTERET where rownum < 100', engine,
parse_dates = ['doi_date_decision', 'doi_date_mod', 'doi_date_update'])
Alternatively, convert with pd.to_datetime() just after reading in and before to_csv:
df['doi_date_decision'] = pd.to_datetime(df['doi_date_decision'])
And most RDMS maintains datetime in YYYY-MM-DD HH:MM:SS format, aligning to pandas format.
回答2:
It is difficult to dig into your issue without some data samples. However, you probably face either of the two cases:
- Some of the rows you select in your second case contain
NULLvalues, which stops pandas interpret automatically your column as a datetime - You have a different MDY convention in your database and some dates lower than 13rd of the month are parsed as dates while others aren't and are kept as strings until you convert manually in DMY
回答3:
Thanks to Boud and Parfait. Their answers are right :
All my tests shows that missing date fields can make Dtype detection to fail.
read_sql_query() has a parameter to define fields with date type. I guess to cure this problem.
Sadly, since now, I have been using a complete generic treatment to process a hundred of tables. To use 'read_sql_query' parameter 'parse_dates' would implies to do a prior work of metadata definition (like a json file describing each table).
Actually, I also found out that integers are changed to float when there is NaN field in the column...
If I would be reading csv flat files, I could understand that data type can be hard to detect... but from a database (read_sql_query)! Pandas has SqlAlchelmy as a dependence. And SqlAlchemy (and even any lower level Python database driver (cx_Oracle, DB API)) has reflection mechanism to detect data types. Pandas could have been using those metadatas to keep data types integrity.
来源:https://stackoverflow.com/questions/39298989/python-pandas-dtypes-detection-from-sql