rdd | 易学教程

Insert Spark dataframe into hbase

阅读更多关于 Insert Spark dataframe into hbase

问题 I have a dataframe and I want to insert it into hbase. I follow this documenation . This is how my dataframe look like: -------------------- |id | name | address | |--------------------| |23 |marry |france | |--------------------| |87 |zied |italie | -------------------- I create a hbase table using this code: val tableName = "two" val conf = HBaseConfiguration.create() if(!admin.isTableAvailable(tableName)) { print("----------------------------------------------------------------------------

In Apache Spark, how to make an RDD/DataFrame operation lazy?

阅读更多关于 In Apache Spark, how to make an RDD/DataFrame operation lazy?

问题 Assuming that I would like to write a function foo that transforms a DataFrame: object Foo { def foo(source: DataFrame): DataFrame = { ...complex iterative algorithm with a stopping condition... } } since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution. This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation

Is there a better way for reduce operation on RDD[Array[Double]]

阅读更多关于 Is there a better way for reduce operation on RDD[Array[Double]]

问题 I want to reduce a RDD[Array[Double]] in order to each element of the array will be add with the same element of the next array. I use this code for the moment : var rdd1 = RDD[Array[Double]] var coord = rdd1.reduce( (x,y) => { (x, y).zipped.map(_+_) }) Is there a better way to make this more efficiently because it cost a harm. 回答1: Using zipped.map is very inefficient, because it creates a lot of temporary objects and boxes the doubles. If you use spire, you can just do this > import spire

Filter from Cassandra table by RDD values

阅读更多关于 Filter from Cassandra table by RDD values

问题 I'd like to query some data from Cassandra based on values I have in an RDD. My approach is the following: val userIds = sc.textFile("/tmp/user_ids").keyBy( e => e ) val t = sc.cassandraTable("keyspace", "users").select("userid", "user_name") val userNames = userIds.flatMap { userId => t.where("userid = ?", userId).take(1) } userNames.take(1) While the Cassandra query works in Spark shell, it throws an exception when I used it inside flatMap: org.apache.spark.SparkException: Job aborted due

How to iterate over large Cassandra table in small chunks in Spark

阅读更多关于 How to iterate over large Cassandra table in small chunks in Spark

问题 In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB. CREATE TABLE foo ( uid timeuuid, events blob, PRIMARY KEY ((uid)) ) Here is scala code that is executed at spark cluster val rdd = sc.cassandraTable("test", "foo") // This pulls records in memory, taking ~6.3GB var count = rdd.select("events").count() // Fails nearly immediately with // NoHostAvailableException: All host(s)

How to filter a RDD according to a function based another RDD in Spark?

阅读更多关于 How to filter a RDD according to a function based another RDD in Spark?

问题 I am a beginner of Apache Spark. I want to filter out all groups whose sum of weight is larger than a constant value in a RDD. The "weight" map is also a RDD. Here is a small-size demo, the groups to be filtered is stored in "groups", the constant value is 12: val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g")) val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6))) val wm = weights.toArray.toMap def isheavy(inp: String):

read use spark phoenix from table to rdd partition number is 1

阅读更多关于 read use spark phoenix from table to rdd partition number is 1

问题 When I ran my spark code: val sqlContext = spark.sqlContext val noact_table = primaryDataProcessor.getTableData(sqlContext, zookeeper, tableName) println("noact_table.rdd:"+noact_table.rdd.partitions.size) val tmp = noact_table.rdd println(tmp.partitions.size) val out = tmp.map(x => x(0) + "," + x(1)) HdfsOperator.writeHdfsFile(out, "/tmp/test/push") getTableData: def getTableData(sqlContext: SQLContext, zkUrl: String, tableName: String): DataFrame = { val tableData = sqlContext.read.format(

Apache Spark Method returning an RDD (with Tail Recursion)

阅读更多关于 Apache Spark Method returning an RDD (with Tail Recursion)

问题 An RDD has a lineage and therefore does not exist until an action if performed on it; so, if I have a method which performs numerous transformations on the RDD and returns a transformed RDD then what am I actually returning? Am I returning nothing until that RDD is required for an action? If I cached an RDD in the method, does it persist in the cache? I think I know the answer to this being: the method will only be run when the action is called on the RDD which is returned? But I could be

Difference between RDDs and Batches in Spark?

阅读更多关于 Difference between RDDs and Batches in Spark?

问题 RDD is a collection of elements partitioned across the nodes of the cluster. It's core component and abstraction. Batches: SparkStreaming API simply divides the data into batches, that batches also same collection of Streaming objects/elements. Based on requirement a set of batches defined in the form time based batch window and intensive online activity based batch window . What is the difference between Rdd & Batches exactly? 回答1: RDD s and batches are essentially different but related

Fetching esJsonRDD from elasticsearch with complex filtering in Spark

阅读更多关于 Fetching esJsonRDD from elasticsearch with complex filtering in Spark

问题 I am currently fetching the elasticsearch RDD in our Spark Job filtering based on one-line elastic query as such (example): val elasticRdds = sparkContext.esJsonRDD(esIndex, s"?default_operator=AND&q=director.name:DAVID + \n movie.name:SEVEN") Now if our search query becomes complex like: { "query": { "filtered": { "query": { "query_string": { "default_operator": "AND", "query": "director.name:DAVID + \n movie.name:SEVEN" } }, "filter": { "nested": { "path": "movieStatus.boxoffice.status",