parquet

Read specific column from Parquet without using Spark

匿名 (未验证) 提交于 2019-12-03 01:36:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code: import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.avro.generic.GenericRecord import org.apache.parquet.hadoop.ParquetReader import org.apache.parquet.avro.AvroParquetReader object parquetToJson{ def main (args : Array[String]):Unit= { //case class Customer(key: Int

Methods for writing Parquet files using Python?

蓝咒 提交于 2019-12-03 01:23:10
I'm having trouble finding a library that allows Parquet files to be written using Python. Bonus points if I can use Snappy or a similar compression mechanism in conjunction with it. Thus far the only method I have found is using Spark with the pyspark.sql.DataFrame Parquet support. I have some scripts that need to write Parquet files that are not Spark jobs. Is there any approach to writing Parquet files in Python that doesn't involve pyspark.sql ? Update (March 2017): There are currently 2 libraries capable of writing Parquet files: fastparquet pyarrow Both of them are still under heavy

How to read parquet data from S3 to spark dataframe Python?

匿名 (未验证) 提交于 2019-12-03 01:20:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location : s3://a-dps/d-l/sco/alpha/20160930/parquet/ The total size of this folder is 20+ Gb ,. How to chunk and read this into a dataframe How to load all these files into a dataframe? Allocated memory to spark cluster is 6 gb. from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark import SparkConf from pyspark.sql import SparkSession import pandas # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop

solved: Scala Spark - overwrite parquet file failed to delete file or dir

匿名 (未验证) 提交于 2019-12-03 01:19:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to create parquet files for several days locally. The first time I run the code, everything works fine. The second time it fails to delete a file. The third time it fails to delete another file. It's totally random which file can not be deleted. The reason I need this to work is because I want to create parquet files everyday for the last seven days. So the parquet files that are already there should be overwritten with the updated data. I use Project SDK 1.8 and scala version 2.11.8 In addition to that, I use spark version 2.0.2

Zeppelin + Spark: Reading Parquet from S3 throws NoSuchMethodError: com.fasterxml.jackson

匿名 (未验证) 提交于 2019-12-03 01:06:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Using Zeppelin 0.7.2 binaries from the main download, and Spark 2.1.0 w/ Hadoop 2.6, the following paragraph: val df = spark.read.parquet(DATA_URL).filter(FILTER_STRING).na.fill("") Produces the following: java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class; at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49) at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>

how to merge multiple parquet files to single parquet file using linux or hdfs command?

心已入冬 提交于 2019-12-03 00:58:22
I have multiple small parquet files generated as output of hive ql job, i would like to merge the output files to single parquet file? what is the best way to do it using some hdfs or linux commands ? we used to merge the text files using cat command, but will this work for parquet as well? Can we do it using HiveQL itself when writing output files like how we do it using repartition or coalesc method in spark ? giaosudau According to this https://issues.apache.org/jira/browse/PARQUET-460 Now you can download the source code and compile parquet-tools which is built in merge command. java -jar

How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

匿名 (未验证) 提交于 2019-12-03 00:50:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I am able to write it into ORC PARQUET directly and TEXTFILE AVRO using additional dependencies from databricks. <dependency> <groupId> com.databricks </groupId> <artifactId> spark-csv_2.10 </artifactId> <version> 1.5.0 </version> </dependency> <dependency> <groupId> com.databricks </groupId> <artifactId> spark-avro_2.10 </artifactId> <version> 2.0.1 </version> </dependency> Sample code: SparkContext sc = new SparkContext ( conf ); HiveContext hc = new HiveContext ( sc ); DataFrame df = hc . table ( hiveTableName ); df .

hive基础知识五

匿名 (未验证) 提交于 2019-12-03 00:13:02
Hive 主流文件存储格式对比 1、存储文件的压缩比测试 1.1 测试数据 https : //github.com/liufengji/Compression_Format_Data log . txt 大小为 18.1 M 1.2 TextFile 创建表,存储数据格式为 TextFile create table log_text ( track_time string , url string , session_id string , referer string , ip string , end_user_id string , city_id string ) row format delimited fields terminated by '\t' stored as textfile ; 向表中加载数据 load data local inpath '/home/hadoop/log.txt' into table log_text ; 查看表的数据量大小 dfs - du - h / user / hive / warehouse / log_text ; +------------------------------------------------+--+ | DFS Output | +--------------------------------

Parquet vs ORC vs ORC with Snappy

跟風遠走 提交于 2019-12-03 00:04:26
问题 I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through. Follows some details of my data. Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB

4. Spark SQL数据源

匿名 (未验证) 提交于 2019-12-02 23:52:01
      Spark SQL的DataFrame接口支持多种数据源的操作。一个DataFrame可以进行RDDs方式的操作,也可以被注册为临时表。把DataFrame注册为临时表之后,就可以对该DataFrame执行SQL查询       Spark SQL的默认数据源为Parquet格式。数据源为Parquet文件时,Spark SQL可以方便的执行所有的操作。修改配置项spark.sql.sources.default,可修改默认数据源格式 val df = spark.read.load("examples/src/main/resources/users.parquet") df.select("name","favorite_color").write.save("namesAndFavColors.parquet")       当数据源格式不是parquet格式文件时,需要手动指定数据源的格式。数据源格式需要指定全名(例如:org.apache.spark.sql.parquet),如果数据源格式为内置格式,则只需要指定简称定json,parquet,jdbc,orc,libsvm,csv,text来指定数据的格式       可以通过SparkSession提供的read.load方法用于通用加载数据,使用write和save保存数据 val peopleDF =