Parallel excel sheet read from dask

℡╲_俬逩灬. 提交于 2020-01-22 16:10:06

问题


Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call.

if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them parallely?

P.S. I am using pandas 0.19.2 with python 2.7


回答1:


For those using Python 3.6:

#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed

parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)

print(df.head())

I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.




回答2:


A simple example

fn = 'my_file.xlsx'
parts = [dask.delayed(pd.read_excel)(fn, i, **other_options) 
         for i in range(number_of_sheets)]
df = dd.from_delayed(parts, meta=parts[0].compute())

Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.

Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.



来源:https://stackoverflow.com/questions/44654906/parallel-excel-sheet-read-from-dask

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!