spark-dataframe

How to write Dataset to a excel file using hadoop office library in apache spark java

微笑、不失礼 提交于 2019-12-10 16:49:05
问题 Currently I am using com.crealytics.spark.excel to read excel file,but using this library I can't write the dataset to an excel file. this link says that using hadoop office library ( org.zuinnote.spark.office.excel ) we can read and write to the excel file Please help me to write dataset object to an excel file in spark java. 回答1: You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark

Can I use log4j2.xml in my Apache Spark application

牧云@^-^@ 提交于 2019-12-10 16:44:11
问题 We are trying to integrate log4j2.xml instead of log4j.properties in Apache Spark application, We integrated log4j2.xml but, the problem is unable to write the worker log of the application and there is no problem for writing driver log. Can any one suggest how to integrate log4j2.xml in Apache Spark application with successful writing of both worker and driver log. Thanks in advance.., 来源: https://stackoverflow.com/questions/37966044/can-i-use-log4j2-xml-in-my-apache-spark-application

Viewing internal Spark Dataframe contents

孤街醉人 提交于 2019-12-10 15:52:32
问题 When debugging a spark program, I can pause the stack and look at the frame to see all the meta data of a DataFrame. Partition metadata like input split, logical plan metadata, underlying RDD metadata, etc. But I cannot see the contents of the DataFrame. The DataFrame is another JVM somewhere on another node, or even on the same node (on a local training cluster). So my question, does anyone use a way for troubleshooting, where they are looking at the contents of the DataFrame partitions the

Creating a Spark DataFrame from a single string

半腔热情 提交于 2019-12-10 14:58:08
问题 I'm trying to take a hardcoded String and turn it into a 1-row Spark DataFrame (with a single column of type StringType ) such that: String fizz = "buzz" Would result with a DataFrame whose .show() method looks like: +-----+ | fizz| +-----+ | buzz| +-----+ My best attempt thus far has been: val rawData = List("fizz") val df = sqlContext.sparkContext.parallelize(Seq(rawData)).toDF() df.show() But I get the following compiler error: java.lang.ClassCastException: org.apache.spark.sql.types

SparkSQL DataFrame order by across partitions

假如想象 提交于 2019-12-10 14:24:51
问题 I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned. I would like to coalesce the resulting DataFrame and order the rows by a column. I tried DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1") result.toJSON().saveAsTextFile("output") I also tried DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1") result.toJSON().saveAsTextFile("output") the output file is ordered in chunks (i.e

How to write dataframe (obtained from hive table) into hadoop SequenceFile and RCFile?

此生再无相见时 提交于 2019-12-10 13:28:16
问题 I am able to write it into ORC PARQUET directly and TEXTFILE AVRO using additional dependencies from databricks. <dependency> <groupId>com.databricks</groupId> <artifactId>spark-csv_2.10</artifactId> <version>1.5.0</version> </dependency> <dependency> <groupId>com.databricks</groupId> <artifactId>spark-avro_2.10</artifactId> <version>2.0.1</version> </dependency> Sample code: SparkContext sc = new SparkContext(conf); HiveContext hc = new HiveContext(sc); DataFrame df = hc.table(hiveTableName)

DataFrame partitionBy on nested columns

ぐ巨炮叔叔 提交于 2019-12-10 13:23:35
问题 I am trying to call partitionBy on a nested field like below: val rawJson = sqlContext.read.json(filename) rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet) I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested? java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField

How can I group pairRDD by keys and turn the values into RDD

核能气质少年 提交于 2019-12-10 11:48:15
问题 So what I have is RDD[(String, Int)] and I need to convert it into Map[String, RDD[Int]] Ex. My input looks like this: RDD[("a", 1), ("a", 2), ("b", 1), ("c", 3)] And the output I'm trying to get is: Map["a" -> RDD[1, 2], "b" -> RDD[1], "c" -> RDD[3]] Thanks in advance! 来源: https://stackoverflow.com/questions/48402479/how-can-i-group-pairrdd-by-keys-and-turn-the-values-into-rdd

How to do custom partition in spark dataframe with saveAsTextFile

时间秒杀一切 提交于 2019-12-10 10:59:31
问题 I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files. I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case. I can not use below option for custom partition because it does not support multi-char delimiter: dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode") .format("csv")

Avoid losing data type for the partitioned data when writing from Spark

丶灬走出姿态 提交于 2019-12-10 10:59:27
问题 I have a dataframe like below. itemName, itemCategory Name1, C0 Name2, C1 Name3, C0 I would like to save this dataframe as partitioned parquet file: df.write.mode("overwrite").partitionBy("itemCategory").parquet(path) For this dataframe, when I read the data back, it will have String the data type for itemCategory . However at times, I have dataframe from other tenants as below. itemName, itemCategory Name1, 0 Name2, 1 Name3, 0 In this case, after being written as partition, when read back,