fastparquet

Moving data from a database to Azure blob storage

懵懂的女人 提交于 2020-04-18 05:41:12
问题 I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N) What would be the next (best) steps to saving it as a parquet file in Azure blob storage? From my small research there are a couple of options: Save locally and use https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data) I believe adlfs is to read from blob use dask

Decompression 'SNAPPY' not available with fastparquet

时间秒杀一切 提交于 2020-01-29 06:29:19
问题 I am trying to use fastparquet to open a file, but I get the error: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] I have the following installed and have rebooted my interpreter: python 3.6.5 hc3d631a_2 python-snappy 0.5.2 py36_0 conda-forge snappy 1.1.7 hbae5bb6_3 fastparquet 0.1.5 py36_0 conda-forge Everything downloaded smoothly. I didn't know if I needed snappy or python-snappy so I got one had no fix and got the other, still with no success. All

Decompression 'SNAPPY' not available with fastparquet

时光怂恿深爱的人放手 提交于 2020-01-29 06:28:12
问题 I am trying to use fastparquet to open a file, but I get the error: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] I have the following installed and have rebooted my interpreter: python 3.6.5 hc3d631a_2 python-snappy 0.5.2 py36_0 conda-forge snappy 1.1.7 hbae5bb6_3 fastparquet 0.1.5 py36_0 conda-forge Everything downloaded smoothly. I didn't know if I needed snappy or python-snappy so I got one had no fix and got the other, still with no success. All

Generating parquet files - differences between R and Python

给你一囗甜甜゛ 提交于 2019-12-10 21:48:44
问题 We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet ) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations? 回答1: (only answering to 1), please post separate questions to make it easier to answer) _metadata

filtering with dask read_parquet method gives unwanted results

独自空忆成欢 提交于 2019-12-08 01:55:02
问题 I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine =

Streaming parquet file python and only downsampling

回眸只為那壹抹淺笑 提交于 2019-12-07 16:07:25
问题 I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be

filtering with dask read_parquet method gives unwanted results

三世轮回 提交于 2019-12-06 12:57:24
I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet') when i read and filter on the dates column from the 'test_par' folder it doesn't seem to work

Streaming parquet file python and only downsampling

↘锁芯ラ 提交于 2019-12-05 21:53:00
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated! Spark is certainly a viable choice for this task. We're planning to add streaming