parquet | 易学教程

Cloudera 5.6: Parquet does not support date. See HIVE-6384

阅读更多关于 Cloudera 5.6: Parquet does not support date. See HIVE-6384

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error. create table sfdc_opportunities_sandbox_parquet like sfdc_opportunities_sandbox STORED AS PARQUET Error Message Parquet does not support date. See HIVE-6384 I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue? 回答1: Except from using an other data type like TIMESTAMP or an other storage format like ORC , there might

Spark Exception : Task failed while writing rows

阅读更多关于 Spark Exception : Task failed while writing rows

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I am reading text files and converting them to parquet files. I am doing it using spark code. But when i try to run the code I get following exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 9, ukfhpdbivp12.uk.experian.local): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands

How can I write a parquet file using Spark (pyspark)?

阅读更多关于 How can I write a parquet file using Spark (pyspark)?

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试): 问题: I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write' from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ") # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.write.parquet("

Is gzipped Parquet file splittable in HDFS for Spark?

阅读更多关于 Is gzipped Parquet file splittable in HDFS for Spark?

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv? Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm. This fact is mainly due to the design of Parquet files that divided in the following parts: Each Parquet files consists of

Inspect Parquet from command line

阅读更多关于 Inspect Parquet from command line

问题 How do I inspect the content of a Parquet file from the command line? The only option I see now is $ hadoop fs -get my-path local-file $ parquet-tools head local-file | less I would like to avoid creating the local-file and view the file content as json rather than the typeless text that parquet-tools prints. Is there an easy way? 回答1: I recommend just building and running the parquet-tools.jar for your Hadoop distribution. Checkout the github project: https://github.com/apache/parquet-mr

How to convert a csv file to parquet

阅读更多关于 How to convert a csv file to parquet

问题 I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that? 回答1: Here is a sample piece of code which does it both ways. 回答2: You can use Apache Drill, as described in Convert a CSV File to Apache Parquet With Drill. In brief: Start Apache Drill: $ cd /opt/drill/bin $ sqlline -u jdbc:drill:zk=local Create the Parquet file: -- Set default table format to parquet ALTER SESSION SET `store

Using pyarrow how do you append to parquet file?

阅读更多关于 Using pyarrow how do you append to parquet file?

问题 How do you append/update to a parquet file with pyarrow ? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]}) pq.write_table(table2, './dataNew/pqTest2.parquet') #append pqTest2 here? There is nothing I found in the docs about appending parquet files. And, Can

Why does Apache Spark read unnecessary Parquet columns within nested structures?

阅读更多关于 Why does Apache Spark read unnecessary Parquet columns within nested structures?

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a schema with nested complex structures val schema = StructType(Seq( StructField("F1", IntegerType),

Unable to infer schema when loading Parquet file

阅读更多关于 Unable to infer schema when loading Parquet file

response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) But then: outcome2 = sqlc.read.parquet(response) # fail fails with: AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' in /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

Read few parquet files at the same time in Spark

阅读更多关于 Read few parquet files at the same time in Spark

I can read few json-files at the same time using * (star): sqlContext.jsonFile('/path/to/dir/*.json') Is there any way to do the same thing for parquet? Star doesn't works. See this issue on the spark jira. It is supported from 1.4 onwards. Without upgrading to 1.4, you could either point at the top level directory: sqlContext.parquetFile('/path/to/dir/') which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs). FYI, you can also: read subset of parquet files using the wildcard symbol *