“Large data” work flows using pandas

前端 未结 16 2305
被撕碎了的回忆
被撕碎了的回忆 2020-11-21 07:32

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it\'s out-of-core support.

16条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-21 08:22

    As noted by others, after some years an 'out-of-core' pandas equivalent has emerged: dask. Though dask is not a drop-in replacement of pandas and all of its functionality it stands out for several reasons:

    Dask is a flexible parallel computing library for analytic computing that is optimized for dynamic task scheduling for interactive computational workloads of “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments and scales from laptops to clusters.

    Dask emphasizes the following virtues:

    • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
    • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
    • Native: Enables distributed computing in Pure Python with access to the PyData stack.
    • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
    • Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process
    • Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans

    and to add a simple code sample:

    import dask.dataframe as dd
    df = dd.read_csv('2015-*-*.csv')
    df.groupby(df.user_id).value.mean().compute()
    

    replaces some pandas code like this:

    import pandas as pd
    df = pd.read_csv('2015-01-01.csv')
    df.groupby(df.user_id).value.mean()
    

    and, especially noteworthy, provides through the concurrent.futures interface a general infrastructure for the submission of custom tasks:

    from dask.distributed import Client
    client = Client('scheduler:port')
    
    futures = []
    for fn in filenames:
        future = client.submit(load, fn)
        futures.append(future)
    
    summary = client.submit(summarize, futures)
    summary.result()
    

提交回复
热议问题