How to read a compressed (gz) CSV file into a dask Dataframe?

橙三吉。 提交于 2020-01-03 08:41:34

问题


Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

I've tried it directly with

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far.

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)

回答1:


It's actually a long-standing limitation of dask. Load the files with dask.delayed instead:

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]

df = dd.from_delayed(dfs) # df is a dask dataframe



回答2:


Panda's current documentation says:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

Since 'infer' is the default, that would explain why it is working with pandas.

Dask's documentation on the compression argument:

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.

i.e.:

import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')



回答3:


Without the file it's difficult to say. what if you set the encoding like # -*- coding: latin-1 -*-? or since read_csv is based off of Pandas, you may even dd.read_csv('Data.gz', encoding='utf-8'). Here's the list of Python encodings: https://docs.python.org/3/library/codecs.html#standard-encodings



来源:https://stackoverflow.com/questions/39924518/how-to-read-a-compressed-gz-csv-file-into-a-dask-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!