parquet

pyarrow data types for columns that have lists of dictionaries?

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-07 01:36:23
问题 Is there a special pyarrow data type I should use for columns which have lists of dictionaries when I save to a parquet file? If I save lists or lists of dictionaries as a string, I normally have to .apply(eval) the field if I read it into memory again in order for pandas to recognize the data as a list (so I can normalize it with pd.json_normalize ) column_a: [ {"id": "something", "value": "else"}, {"id": "something2", "value": "else2"}, ] column_b: ["test", "test2", "test3"] Just wondering

Can't install python-snappy wheel in Pycharm

大兔子大兔子 提交于 2021-01-07 01:23:07
问题 I have a question here, and then I have followed this answer https://stackoverflow.com/a/43756412/12375559 to download the file and installed from my windows prompt, and it seems the python-snappy has been installed C:\Users\xxxx\IdeaProjects\xxxx\venv>pip install python_snappy-0.5.4-cp38-cp38-win32.whl Processing c:\users\xxxxxx\ideaprojects\xxxxxx\venv\python_snappy-0.5.4-cp38-cp38-win32.whl Installing collected packages: python-snappy Successfully installed python-snappy-0.5.4 WARNING: You

Can't install python-snappy wheel in Pycharm

谁说胖子不能爱 提交于 2021-01-07 01:20:01
问题 I have a question here, and then I have followed this answer https://stackoverflow.com/a/43756412/12375559 to download the file and installed from my windows prompt, and it seems the python-snappy has been installed C:\Users\xxxx\IdeaProjects\xxxx\venv>pip install python_snappy-0.5.4-cp38-cp38-win32.whl Processing c:\users\xxxxxx\ideaprojects\xxxxxx\venv\python_snappy-0.5.4-cp38-cp38-win32.whl Installing collected packages: python-snappy Successfully installed python-snappy-0.5.4 WARNING: You

Can't install python-snappy wheel in Pycharm

岁酱吖の 提交于 2021-01-07 01:16:46
问题 I have a question here, and then I have followed this answer https://stackoverflow.com/a/43756412/12375559 to download the file and installed from my windows prompt, and it seems the python-snappy has been installed C:\Users\xxxx\IdeaProjects\xxxx\venv>pip install python_snappy-0.5.4-cp38-cp38-win32.whl Processing c:\users\xxxxxx\ideaprojects\xxxxxx\venv\python_snappy-0.5.4-cp38-cp38-win32.whl Installing collected packages: python-snappy Successfully installed python-snappy-0.5.4 WARNING: You

【spark系列7】spark delta写操作ACID事务实现分析

只愿长相守 提交于 2021-01-05 16:11:30
背景 本文基于delta 0.7.0 spark 3.0.1 我们之前的 spark delta写操作ACID事务前传--写文件基础类FileFormat/FileCommitProtocol分析 分析了delta写数据的流程,但是还没分析deltalog 写数据的流程,这部分也是实现ACID的核心部分。 ##分析 直接到 WriteIntoDelta.run override def run(sparkSession: SparkSession): Seq[Row] = { deltaLog.withNewTransaction { txn => val actions = write(txn, sparkSession) val operation = DeltaOperations.Write(mode, Option(partitionColumns), options.replaceWhere, options.userMetadata) txn.commit(actions, operation) } Seq.empty } 我们来看一下 deltaLog.withNewTrancation 方法 : def withNewTransaction[T](thunk: OptimisticTransaction => T): T = { try { update()

每个大数据工程师都应该知道的OLAP 核心知识点

陌路散爱 提交于 2021-01-05 12:00:36
OLAP 系统广泛应用于 BI, Reporting, Ad-hoc, ETL 数仓分析等场景,本文主要从体系化的角度来分析 OLAP 系统的核心技术点,从业界已有的 OLAP 中萃取其共性,分为谈存储,谈计算,谈优化器,谈趋势 4 个章节。 01 谈储存 列存的数据组织形式 行存,可以看做 NSM (N-ary Storage Model) 组织形式,一直伴随着关系型数据库,对于 OLTP 场景友好,例如 innodb[1] 的 B+ 树聚簇索引,每个 Page 中包含若干排序好的行,可以很好的支持 tuple-at-a-time 式的点查以及更新等;而列存 (Column-oriented Storage),经历了早期的 DSM (Decomposition Storage Model) [2],以及后来提出的 PAX (Partition Attributes Cross) 尝试混合 NSM 和 DSM,在 C-Store 论文 [3] 后逐渐被人熟知,用于 OLAP,分析型不同于交易场景,存储 IO 往往是瓶颈,而列存可以只读取需要的列,跳过无用数据,避免 IO 放大,同质数据存储更紧凑,编码压缩友好,这些优势可以减少 IO,进而提高性能。 列存的数据组织形式 对于基本类型,例如数值、string 等,列存可以使用合适的编码,减少数据体积,在 C-Store

Updating values in apache parquet file

守給你的承諾、 提交于 2020-12-29 02:56:28
问题 I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this. 回答1: Lets start with basics: Parquet is a file format that needs to be saved in a file system. Key questions: Does parquet support append operations? Does the file system (namely, HDFS) allow append on files? Can the job framework

Updating values in apache parquet file

懵懂的女人 提交于 2020-12-29 02:55:25
问题 I have a quite hefty parquet file where I need to change values for one of the column. One way to do this would be to update those values in source text files and recreate parquet file but I'm wondering if there is less expensive and overall easier solution to this. 回答1: Lets start with basics: Parquet is a file format that needs to be saved in a file system. Key questions: Does parquet support append operations? Does the file system (namely, HDFS) allow append on files? Can the job framework

Error when installing python-snappy in PyCharm

柔情痞子 提交于 2020-12-13 21:04:45
问题 I have a '.snappy.parquet' file and I wanted to view the content in this file, I know I can use pandas and PySpark. This is beyond my knowledge, I'm not sure what to do, can someone help me please... I've been struggling for almost a day now.... Many thanks. (and if I can't fix this issue, do I have other options to convert this file to a readable file?) 回答1: This issue has been solved by using the approach here: Can't install python-snappy wheel in Pycharm 回答2: You need snappy library

Error when installing python-snappy in PyCharm

泪湿孤枕 提交于 2020-12-13 21:00:22
问题 I have a '.snappy.parquet' file and I wanted to view the content in this file, I know I can use pandas and PySpark. This is beyond my knowledge, I'm not sure what to do, can someone help me please... I've been struggling for almost a day now.... Many thanks. (and if I can't fix this issue, do I have other options to convert this file to a readable file?) 回答1: This issue has been solved by using the approach here: Can't install python-snappy wheel in Pycharm 回答2: You need snappy library