parquet

HIVE STORED&Row format(四)

本小妞迷上赌 提交于 2019-11-28 10:01:06
HIVE STORED&Row format(四) 转载自https://blog.csdn.net/mhtian2015/article/details/78873815 HIVE STORED&Row format hive表数据在存储在文件系统上的,因此需要有文件存储格式来规范化数据的存储,一边hive写数据或者读数据。hive有一些已构建好的存储格式,也支持用户自定义文件存储格式。主要由两部分内容构成file_format和row_format,两者息息相关,在create table语句中结构如下: [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later) ] row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED

Create Hive table to read parquet files from parquet/avro schema

大城市里の小女人 提交于 2019-11-28 09:30:21
We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema. in other way, how to generate a hive table from a parquet/avro schema ? thanks :) Ram Manohar Try below using avro schema: CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath'; Same query is asked in Dynamically create Hive external table with

How to delete a particular month from a parquet file partitioned by month

落爺英雄遲暮 提交于 2019-11-28 06:53:14
问题 I am having monthly Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet formats in append mode, but partitioned by month column. Here is the pseudo-code below - def Revenue(filename): df = spark.read.load(filename) . . df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue') Revenue('Revenue_201501.csv') Revenue('Revenue_201502.csv') Revenue('Revenue_201503.csv') Revenue('Revenue_201504.csv') Revenue('Revenue_201505

Schema evolution in parquet format

倖福魔咒の 提交于 2019-11-28 06:49:57
Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is still schema evolution . Does anyone know if schema evolution is possible in parquet, if yes How is it possible, if no then Why not. Some resources claim that it is possible but it can only add columns at end . What does this mean? Schema evolution can be (very) expensive. In order to figure out schema, you basically have to

Why can't Impala read parquet files after Spark SQL's write?

╄→гoц情女王★ 提交于 2019-11-28 06:35:58
问题 Having some issues with the way that Spark is interpreting columns for parquet. I have an Oracle source with confirmed schema (df.schema() method): root |-- LM_PERSON_ID: decimal(15,0) (nullable = true) |-- LM_BIRTHDATE: timestamp (nullable = true) |-- LM_COMM_METHOD: string (nullable = true) |-- LM_SOURCE_IND: string (nullable = true) |-- DATASET_ID: decimal(38,0) (nullable = true) |-- RECORD_ID: decimal(38,0) (nullable = true) Which is then saved as Parquet - df.write().parquet() method -

Reading DataFrame from partitioned parquet file

你离开我真会死。 提交于 2019-11-28 05:27:05
How to read partitioned parquet with condition as dataframe, this works fine, val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") Partition is there for day=1 to day=30 is it possible to read something like (day = 5 to 6) or day=5,day=6 , val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") If I put * it gives me all 30 days data and it too big. sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and

What are the differences between feather and parquet?

喜夏-厌秋 提交于 2019-11-28 04:29:59
Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow ( pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats differ? Should you always prefer feather when working with pandas when possible? What are the use cases where feather is more suitable than parquet and the other way round? Appendix I found some hints here https://github.com/wesm/feather/issues/188 , but given the young age of this project, it's possibly a bit out of date. Not a serious speed test

Avro vs. Parquet

a 夏天 提交于 2019-11-28 03:18:04
I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms? If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g., job.setOutputFormatClass

Read parquet data from AWS s3 bucket

*爱你&永不变心* 提交于 2019-11-28 02:58:54
问题 I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this: S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey)); InputStream inputStream = object.getObjectContent(); But the apache parquet reader uses only local file like this: ParquetReader<Group> reader = ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath())) .withConf(conf) .build(); reader.read() So I don't know how parse input stream for parquet

Apache Drill has bad performance against SQL Server

别说谁变了你拦得住时间么 提交于 2019-11-28 02:42:15
问题 I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was: SELECT p.Product_Category, SUM(f.sales) FROM facts f JOIN Product p on f.pkey = p.pkey GROUP BY p.Product_Category Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows. First I tested this query on SqlServer and got a result back in about 150ms. With drill I first tried to connect directly to SqlServer and run the query, but that was