Loading Cassandra Data into Dask Dataframe

心已入冬 提交于 2020-01-05 12:58:35

问题


I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:

query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df)) 

TypeError                                 Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))

    TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'

Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.


回答1:


Some problems with your code:

  • the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.

  • list(df) produces a list of the column names of a dataframe and drops all the data

  • dd.DataFrame, if you read the docs is not constructed like this.

What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:

@dask.delayed
def part(x):
    session = # construct Cassandra session
    q = "SELECT * FROM document_table WHERE partfield={}".format(x)
    df = man.session.execute(query)
    return dd.DataFrame(list(df)) 

parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)


来源:https://stackoverflow.com/questions/53123633/loading-cassandra-data-into-dask-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!