How to create Dask DataFrame from a list of urls?

孤者浪人 提交于 2020-01-13 13:12:13

问题


I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that?

Here is an example:

link = 'http://web.mta.info/developers/'

data = [     'data/nyct/turnstile/turnstile_170128.txt',
                        'data/nyct/turnstile/turnstile_170121.txt',
                        'data/nyct/turnstile/turnstile_170114.txt',
                        'data/nyct/turnstile/turnstile_170107.txt' 
        ]

and what I want is

df = dd.read_csv('XXXX*X')


回答1:


Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe

import pandas as pd
import dask
import dask.dataframe as dd

dfs = [dask.delayed(pd.read_csv)(url) for url in urls]

df = dd.from_delayed(dfs)

This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)

See also: http://dask.pydata.org/en/latest/delayed-collections.html



来源:https://stackoverflow.com/questions/43104302/how-to-create-dask-dataframe-from-a-list-of-urls

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!