rdd

Spark RDD to DataFrame python

匿名 (未验证) 提交于 2019-12-03 01:25:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to sqlContext.CreateDataFrame(rdd,schema) function. But I have 38 columns or fields and this will increase further. If I manually give the schema specifying each field information, that it going to be so tedious job. Is there any other way to specify the schema without knowing the information of the columns prior. 回答1: See, There are two ways to convert an RDD to DF in Spark. toDF() and createDataFrame(rdd, schema) I will

NullPointerException in Spark RDD map when submitted as a spark job

匿名 (未验证) 提交于 2019-12-03 01:25:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: We're trying to submit a spark job (spark 2.0, hadoop 2.7.2) but for some reason we're receiving a rather cryptic NPE in EMR. Everything runs just fine as a scala program so we're not really sure what's causing the issue. Here's the stack trace: 18:02:55,271 ERROR Utils:91 - Aborting task java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)

Saving RDD as sequence file in pyspark

匿名 (未验证) 提交于 2019-12-03 01:23:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am able to run this script to save the file in text format, but when I try to run saveAsSequenceFile it is erroring out. If any one have idea about how to save the RDD as sequence file, please let me know the process. I tried looking for solution in "Learning Spark" as well as official Spark documentation. This runs successfully dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") dataRDD.saveAsTextFile("/user/cloudera/pyspark/departments") This fails dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") dataRDD

Write RDD as textfile using Apache Spark

匿名 (未验证) 提交于 2019-12-03 01:23:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am exploring Spark for batch processing. I am running the spark on my local machine using standalone mode. I am trying to convert the Spark RDD as single file [final output] using saveTextFile() method, but its not working. For example if i have more than one partition how we can get one single file as final output. Update: I tried the below approaches, but i am getting null pointer exception. person.coalesce(1).toJavaRDD().saveAsTextFile("C://Java_All//output"); person.repartition(1).toJavaRDD().saveAsTextFile("C://Java_All//output"); The

Spark when union a lot of RDD throws stack overflow error

匿名 (未验证) 提交于 2019-12-03 01:23:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: When I use "++" to combine a lot of RDDs, I got error stack over flow error. Spark version 1.3.1 Environment: yarn-client. --driver-memory 8G The number of RDDs is more than 4000. Each RDD is read from a text file with size of 1 GB. It is generated in this way val collection = (for ( path It works fine when files has small size. And there is the error The error repeats itself. I guess it is a recursion function which is called too many time? Exception at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.UnionRDD$

Get CSV to Spark dataframe

匿名 (未验证) 提交于 2019-12-03 01:23:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm using python on Spark and would like to get a csv into a dataframe. The documentation for Spark SQL strangely does not provide explanations for CSV as a source. I have found Spark-CSV , however I have issues with two parts of the documentation: "This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3" Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very

What is the difference between cache and persist?

匿名 (未验证) 提交于 2019-12-03 01:18:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: In terms of RDD persistence, what are the differences between cache() and persist() in spark ? 回答1: With cache() , you use only the default storage level MEMORY_ONLY . With persist() , you can specify which storage level you want,( rdd-persistence ). From the official docs: You can mark an RDD to be persisted using the persist () or cache () methods on it. each persisted RDD can be stored using a different storage level The cache () method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store

Apache Spark: map vs mapPartitions?

匿名 (未验证) 提交于 2019-12-03 01:14:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions ? Thanks. (edit) i.e. what is the difference (either semantically or in terms of execution) between def map [ A , B ]( rdd : RDD [ A ], fn : ( A => B )) ( implicit a : Manifest [ A ], b : Manifest [ B ]): RDD [ B ] = { rdd . mapPartitions ({ iter : Iterator [ A ] => for ( i And: def map [ A , B ]( rdd : RDD [ A ], fn : ( A => B )) ( implicit a : Manifest [ A ], b : Manifest [ B ]): RDD [ B ] = { rdd . map ( fn

Spark rdd correct date format in scala?

匿名 (未验证) 提交于 2019-12-03 01:06:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: This is the date value I want to use when I convert RDD to Dataframe. Sun Jul 31 10:21:53 PDT 2016 This schema "DataTypes.DateType" throws an error. java.util.Date is not a valid external type for schema of date So I want to prepare RDD in advance in such a way that above schema can work. How can I correct the date format to work in conversion to dataframe? //Schema for data frame val schema = StructType( StructField("lotStartDate", DateType, false) :: StructField("pm", StringType, false) :: StructField("wc", LongType, false) :: StructField(

Spark: How to split an RDD[T]` into Seq[RDD[T]] and preserve the ordering

匿名 (未验证) 提交于 2019-12-03 00:56:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: How can I effectively split up an RDD[T] into a Seq[RDD[T]] / Iterable[RDD[T]] with n elements and preserve the original ordering? I would like to be able to write something like this RDD(1, 2, 3, 4, 5, 6, 7, 8, 9).split(3) which should result in something like Seq(RDD(1, 2, 3), RDD(4, 5, 6), RDD(7, 8, 9)) Does spark provide such a function? If not what is a performant way to achieve this? val parts = rdd.length / n val rdds = rdd.zipWithIndex().map{ case (t, i) => (i - (i % parts), t)}.groupByKey().values.map(iter => sc.parallelize(iter