spark-dataframe | 易学教程

How to write Dataset to a excel file using hadoop office library in apache spark java

阅读更多关于 How to write Dataset to a excel file using hadoop office library in apache spark java

问题 Currently I am using com.crealytics.spark.excel to read excel file,but using this library I can't write the dataset to an excel file. this link says that using hadoop office library ( org.zuinnote.spark.office.excel ) we can read and write to the excel file Please help me to write dataset object to an excel file in spark java. 回答1: You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark

Can I use log4j2.xml in my Apache Spark application

阅读更多关于 Can I use log4j2.xml in my Apache Spark application

问题 We are trying to integrate log4j2.xml instead of log4j.properties in Apache Spark application, We integrated log4j2.xml but, the problem is unable to write the worker log of the application and there is no problem for writing driver log. Can any one suggest how to integrate log4j2.xml in Apache Spark application with successful writing of both worker and driver log. Thanks in advance.., 来源： https://stackoverflow.com/questions/37966044/can-i-use-log4j2-xml-in-my-apache-spark-application

Viewing internal Spark Dataframe contents

阅读更多关于 Viewing internal Spark Dataframe contents

问题 When debugging a spark program, I can pause the stack and look at the frame to see all the meta data of a DataFrame. Partition metadata like input split, logical plan metadata, underlying RDD metadata, etc. But I cannot see the contents of the DataFrame. The DataFrame is another JVM somewhere on another node, or even on the same node (on a local training cluster). So my question, does anyone use a way for troubleshooting, where they are looking at the contents of the DataFrame partitions the

Creating a Spark DataFrame from a single string

阅读更多关于 Creating a Spark DataFrame from a single string

问题 I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType ) such that: String fizz = "buzz" Would result with a DataFrame whose .show() method looks like: +-----+ | fizz| +-----+ | buzz| +-----+ My best attempt thus far has been: val rawData = List("fizz") val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF() df.show() But I get the following compiler error: java.lang.ClassCastException: org.apache.spark.sql.types

SparkSQL DataFrame order by across partitions

阅读更多关于 SparkSQL DataFrame order by across partitions

问题 I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned. I would like to coalesce the resulting DataFrame and order the rows by a column. I tried DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1") result.toJSON().saveAsTextFile("output") I also tried DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1") result.toJSON().saveAsTextFile("output") the output file is ordered in chunks (i.e

How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

阅读更多关于 How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

问题 I am able to write it into ORC PARQUET directly and TEXTFILE AVRO using additional dependencies from databricks. <dependency> <groupId>com.databricks</groupId> <artifactId>spark-csv_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>2.0.1</version> </dependency> Sample code: SparkContext sc = new SparkContext(conf); HiveContext hc = new HiveContext(sc); DataFrame df = hc.table(hiveTableName)

DataFrame partitionBy on nested columns

阅读更多关于 DataFrame partitionBy on nested columns

问题 I am trying to call partitionBy on a nested field like below: val rawJson = sqlContext.read.json(filename) rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet) I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested? java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField

How can I group pairRDD by keys and turn the values into RDD

阅读更多关于 How can I group pairRDD by keys and turn the values into RDD

问题 So what I have is RDD[(String, Int)] and I need to convert it into Map[String, RDD[Int]] Ex. My input looks like this: RDD[("a", 1), ("a", 2), ("b", 1), ("c", 3)] And the output I'm trying to get is: Map["a" -> RDD[1, 2], "b" -> RDD[1], "c" -> RDD[3]] Thanks in advance! 来源： https://stackoverflow.com/questions/48402479/how-can-i-group-pairrdd-by-keys-and-turn-the-values-into-rdd

How to do custom partition in spark dataframe with saveAsTextFile

阅读更多关于 How to do custom partition in spark dataframe with saveAsTextFile

问题 I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files. I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case. I can not use below option for custom partition because it does not support multi-char delimiter: dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode") .format("csv")

Avoid losing data type for the partitioned data when writing from Spark

阅读更多关于 Avoid losing data type for the partitioned data when writing from Spark

问题 I have a dataframe like below. itemName, itemCategory Name1, C0 Name2, C1 Name3, C0 I would like to save this dataframe as partitioned parquet file: df.write.mode("overwrite").partitionBy("itemCategory").parquet(path) For this dataframe, when I read the data back, it will have String the data type for itemCategory . However at times, I have dataframe from other tenants as below. itemName, itemCategory Name1, 0 Name2, 1 Name3, 0 In this case, after being written as partition, when read back,