Dask read_csv— Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

问题

I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID be object(string). Anyone can help me to read data successfully?

Traceback is like below:

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+------------+--------+----------+

| Column     | Found  | Expected |

+------------+--------+----------+

| ARTICLE_ID | object | int64    |

+------------+--------+----------+

The following columns also raised exceptions on conversion:

ARTICLE_ID:


ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",)

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'ARTICLE_ID': 'object'}

to the call to `read_csv`/`read_table`.

回答1:

The message is suggesting that your change your call from

df = dd.read_csv('mylocation.csv', ...)

df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})

where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.

回答2:

You can use sample parameter in read_csv method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).

df = dd.read_csv("game_logs.csv", sample=25000000)

https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

来源：https://stackoverflow.com/questions/52486658/dask-read-csv-mismatched-dtypes-found-in-pd-read-csv-pd-read-table

标签

python

dataframe

dask