问题
I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID
be object(string)
. Anyone can help me to read data successfully?
Traceback is like below:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+------------+--------+----------+
| Column | Found | Expected |
+------------+--------+----------+
| ARTICLE_ID | object | int64 |
+------------+--------+----------+
The following columns also raised exceptions on conversion:
ARTICLE_ID:
ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",)
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'ARTICLE_ID': 'object'}
to the call to `read_csv`/`read_table`.
回答1:
The message is suggesting that your change your call from
df = dd.read_csv('mylocation.csv', ...)
to
df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})
where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.
回答2:
You can use sample
parameter in read_csv
method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).
df = dd.read_csv("game_logs.csv", sample=25000000)
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
来源:https://stackoverflow.com/questions/52486658/dask-read-csv-mismatched-dtypes-found-in-pd-read-csv-pd-read-table