parquet

Cloudera 5.6: Parquet does not support date. See HIVE-6384

匿名 (未验证) 提交于 2019-12-03 08:48:34
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error. create table sfdc_opportunities_sandbox_parquet like sfdc_opportunities_sandbox STORED AS PARQUET Error Message Parquet does not support date. See HIVE-6384 I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue? 回答1: Except from using an other data type like TIMESTAMP or an other storage format like ORC , there might

Spark Exception : Task failed while writing rows

匿名 (未验证) 提交于 2019-12-03 08:46:08
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am reading text files and converting them to parquet files. I am doing it using spark code. But when i try to run the code I get following exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 9, ukfhpdbivp12.uk.experian.local): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands

How can I write a parquet file using Spark (pyspark)?

匿名 (未验证) 提交于 2019-12-03 08:44:33
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write' from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ") # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.write.parquet("

Is gzipped Parquet file splittable in HDFS for Spark?

血红的双手。 提交于 2019-12-03 07:09:30
I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv? Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm. This fact is mainly due to the design of Parquet files that divided in the following parts: Each Parquet files consists of

Inspect Parquet from command line

主宰稳场 提交于 2019-12-03 06:29:23
问题 How do I inspect the content of a Parquet file from the command line? The only option I see now is $ hadoop fs -get my-path local-file $ parquet-tools head local-file | less I would like to avoid creating the local-file and view the file content as json rather than the typeless text that parquet-tools prints. Is there an easy way? 回答1: I recommend just building and running the parquet-tools.jar for your Hadoop distribution. Checkout the github project: https://github.com/apache/parquet-mr

How to convert a csv file to parquet

南笙酒味 提交于 2019-12-03 05:51:25
问题 I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that? 回答1: Here is a sample piece of code which does it both ways. 回答2: You can use Apache Drill, as described in Convert a CSV File to Apache Parquet With Drill. In brief: Start Apache Drill: $ cd /opt/drill/bin $ sqlline -u jdbc:drill:zk=local Create the Parquet file: -- Set default table format to parquet ALTER SESSION SET `store

Using pyarrow how do you append to parquet file?

孤人 提交于 2019-12-03 05:51:16
问题 How do you append/update to a parquet file with pyarrow ? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]}) pq.write_table(table2, './dataNew/pqTest2.parquet') #append pqTest2 here? There is nothing I found in the docs about appending parquet files. And, Can

Why does Apache Spark read unnecessary Parquet columns within nested structures?

有些话、适合烂在心里 提交于 2019-12-03 05:37:12
My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a schema with nested complex structures val schema = StructType(Seq( StructField("F1", IntegerType),

Unable to infer schema when loading Parquet file

守給你的承諾、 提交于 2019-12-03 04:39:10
response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) But then: outcome2 = sqlc.read.parquet(response) # fail fails with: AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' in /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

Read few parquet files at the same time in Spark

℡╲_俬逩灬. 提交于 2019-12-03 04:37:51
I can read few json-files at the same time using * (star): sqlContext.jsonFile('/path/to/dir/*.json') Is there any way to do the same thing for parquet? Star doesn't works. See this issue on the spark jira. It is supported from 1.4 onwards. Without upgrading to 1.4, you could either point at the top level directory: sqlContext.parquetFile('/path/to/dir/') which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs). FYI, you can also: read subset of parquet files using the wildcard symbol *