fastparquet | 易学教程

Reading large number of parquet files: read_parquet vs from_delayed

阅读更多关于 Reading large number of parquet files: read_parquet vs from_delayed

来源： https://stackoverflow.com/questions/59934152/reading-large-number-of-parquet-files-read-parquet-vs-from-delayed

Pandas Read/Write Parquet Data using Column Index

阅读更多关于 Pandas Read/Write Parquet Data using Column Index

来源： https://stackoverflow.com/questions/62252259/pandas-read-write-parquet-data-using-column-index

Moving data from a database to Azure blob storage

阅读更多关于 Moving data from a database to Azure blob storage

问题 I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N) What would be the next (best) steps to saving it as a parquet file in Azure blob storage? From my small research there are a couple of options: Save locally and use https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data) I believe adlfs is to read from blob use dask

Decompression 'SNAPPY' not available with fastparquet

阅读更多关于 Decompression 'SNAPPY' not available with fastparquet

问题 I am trying to use fastparquet to open a file, but I get the error: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED'] I have the following installed and have rebooted my interpreter: python 3.6.5 hc3d631a_2 python-snappy 0.5.2 py36_0 conda-forge snappy 1.1.7 hbae5bb6_3 fastparquet 0.1.5 py36_0 conda-forge Everything downloaded smoothly. I didn't know if I needed snappy or python-snappy so I got one had no fix and got the other, still with no success. All

Decompression 'SNAPPY' not available with fastparquet

阅读更多关于 Decompression 'SNAPPY' not available with fastparquet

Generating parquet files - differences between R and Python

阅读更多关于 Generating parquet files - differences between R and Python

问题 We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet ) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations? 回答1: (only answering to 1), please post separate questions to make it easier to answer) _metadata

filtering with dask read_parquet method gives unwanted results

阅读更多关于 filtering with dask read_parquet method gives unwanted results

问题 I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine =

Streaming parquet file python and only downsampling

阅读更多关于 Streaming parquet file python and only downsampling

问题 I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be

filtering with dask read_parquet method gives unwanted results

阅读更多关于 filtering with dask read_parquet method gives unwanted results

I am trying to read parquet files using the dask read_parquet method and the filters kwarg. however it sometimes doesn't filter according to the given condition. Example: creating and saving data frame with a dates column import pandas as pd import numpy as np import dask.dataframe as dd nums = range(1,6) dates = pd.date_range('2018-07-01', periods=5, freq='1d') df = pd.DataFrame({'dates':dates, 'nums': nums}) ddf = dd.from_pandas(df, npartitions=3).to_parquet('test_par', engine = 'fastparquet') when i read and filter on the dates column from the 'test_par' folder it doesn't seem to work

Streaming parquet file python and only downsampling

阅读更多关于 Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated! Spark is certainly a viable choice for this task. We're planning to add streaming