问题
Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call.
if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them parallely?
P.S. I am using pandas 0.19.2 with python 2.7
回答1:
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
回答2:
A simple example
fn = 'my_file.xlsx'
parts = [dask.delayed(pd.read_excel)(fn, i, **other_options)
for i in range(number_of_sheets)]
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.
来源:https://stackoverflow.com/questions/44654906/parallel-excel-sheet-read-from-dask