parquet

Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?

故事扮演 提交于 2019-12-21 15:12:10
问题 I've been hunting around for a solution to this question. It appears to me that there is no way to embed reading and writing Parquet format in a Java program without pulling in dependencies on HDFS and Hadoop. Is this correct? I want to read and write on a client machine, outside of a Hadoop cluster. I started to get excited about Apache Drill, but it appears that it must run as a separate process. What I need is an in-process ability to read and write a file using the Parquet format. 回答1:

Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?

人盡茶涼 提交于 2019-12-21 15:10:11
问题 I've been hunting around for a solution to this question. It appears to me that there is no way to embed reading and writing Parquet format in a Java program without pulling in dependencies on HDFS and Hadoop. Is this correct? I want to read and write on a client machine, outside of a Hadoop cluster. I started to get excited about Apache Drill, but it appears that it must run as a separate process. What I need is an in-process ability to read and write a file using the Parquet format. 回答1:

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

折月煮酒 提交于 2019-12-21 11:12:07
问题 I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation. But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena). So my question is, what does MSCK REPAIR TABLE actually do behind the scenes and why? How does MSCK REPAIR TABLE

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

青春壹個敷衍的年華 提交于 2019-12-21 11:11:33
问题 I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation. But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena). So my question is, what does MSCK REPAIR TABLE actually do behind the scenes and why? How does MSCK REPAIR TABLE

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

痴心易碎 提交于 2019-12-21 05:34:12
问题 I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36,

Read Parquet file stored in S3 with AWS Lambda (Python 3)

人盡茶涼 提交于 2019-12-21 05:07:13
问题 I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3,

Read Parquet file stored in S3 with AWS Lambda (Python 3)

China☆狼群 提交于 2019-12-21 05:07:13
问题 I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3,

Spark学习之Spark SQL

▼魔方 西西 提交于 2019-12-21 04:31:43
一、简介    Spark SQL 提供了以下三大功能。    (1) Spark SQL 可以从各种结构化数据源(例如 JSON、Hive、Parquet 等)中读取数据。    (2) Spark SQL 不仅支持在 Spark 程序内使用 SQL 语句进行数据查询,也支持从类似商业智能软件 Tableau 这样的外部工具中通过标准数据库连接器(JDBC/ODBC)连接 SparkSQL 进行查询。    (3) 当在 Spark 程序内使用 Spark SQL 时,Spark SQL 支持 SQL 与常规的 Python/Java/Scala代码高度整合,包括连接 RDD 与 SQL 表、公开的自定义 SQL 函数接口等。这样一来,许多工作都更容易实现了。 二、Spark SQL基本示例    import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SQLContext object Test { def main(args: Array[String]): Unit = { val conf = new SparkConf()

How to save a partitioned parquet file in Spark 2.1?

大城市里の小女人 提交于 2019-12-21 04:01:00
问题 I am trying to test how to write data in HDFS 2.7 using Spark 2.1. My data is a simple sequence of dummy values and the output should be partitioned by the attributes: id and key . // Simple case class to cast the data case class SimpleTest(id:String, value1:Int, value2:Float, key:Int) // Actual data to be stored val testData = Seq( SimpleTest("test", 12, 13.5.toFloat, 1), SimpleTest("test", 12, 13.5.toFloat, 2), SimpleTest("test", 12, 13.5.toFloat, 3), SimpleTest("simple", 12, 13.5.toFloat,

Why does Apache Spark read unnecessary Parquet columns within nested structures?

巧了我就是萌 提交于 2019-12-20 17:19:26
问题 My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes". But we're seeing unexpected columns being read for nested schema structures. To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell: // Preliminary setup sc.setLogLevel("INFO") import org.apache.spark.sql.types._ import org.apache.spark.sql._ // Create a