rdd

Get a range of columns of Spark RDD

橙三吉。 提交于 2019-12-01 16:23:19
问题 Now I have 300+ columns in my RDD, but I found there is a need to dynamically select a range of columns and put them into LabledPoints data type. As a newbie to Spark, I am wondering if there is any index way to select a range of columns in RDD. Something like temp_data = data[, 101:211] in R. Is there something like val temp_data = data.filter(_.column_index in range(101:211)... ? Any thought is welcomed and appreciated. 回答1: If it is a DataFrame, then something like this should work: val df

小记---------spark架构原理&主要组件和进程

≡放荡痞女 提交于 2019-12-01 15:58:42
spark的主要组件和进程 driver (进程): 我们编写的spark程序就在driver上,由driver进程执行 master(进程): 主要负责资源的调度和分配,还有集群的监控 worker(进程): 主要负责 1.用自己的内存 存储RDD的某个或某些partition; 2.启动其它进程和线程,对RDD上的partition进行版型的处理和计算 executor(进程): 负责对RDD的partition进行并行计算,也就是执行我们对RDD栋定义,比如map/flatmap/reduce等算子操作 task(线程): 对RDD的partition数据执行指定的算子操作 spark架构原理大致步骤: driver进程启动之后,会做初始化的操作,在这个过程中会发送请求到Master上,进行spark应用程序的注册,其实就是让master知道,有一个新的spark应用程序要运行 master在接收到spark应用程序的注册申请之后,会发送请求给worker,进行资源的调度和分配;其实就是资源的分配就是对executor的分配 worker接收到master的请求后会为spark应用启动executor executor启动之后,会向driver进行反注册,这样driver就知道哪些executor是为它进行服务的 driver注册了一些executor之后

Trying to get spark streaming to read data stream from website, what is the socket?

China☆狼群 提交于 2019-12-01 14:26:17
I am trying to get this data http://stream.meetup.com/2/rsvps into spark stream They are JSON objects, I know the lines will be strings, I just want it to work before I try JSON. I am not sure what to put as the port, I assume that is the problem. SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("Spark Streaming"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); JavaReceiverInputDStream<String> lines = jssc.socketTextStream("http://stream.meetup.com/2/rsvps", 80); lines.print(); jssc.start(); jssc.awaitTermination(); Here is my error java.net

Scan a Hadoop Database table in Spark using indices from an RDD

你离开我真会死。 提交于 2019-12-01 13:44:45
So if there is a table in the database shown as below: Key2 DateTimeAge AAA1 XXX XXX XXX AAA2 XXX XXX XXX AAA3 XXX XXX XXX AAA4 XXX XXX XXX AAA5 XXX XXX XXX AAA6 XXX XXX XXX AAA7 XXX XXX XXX AAA8 XXX XXX XXX BBB1 XXX XXX XXX BBB2 XXX XXX XXX BBB3 XXX XXX XXX BBB4 XXX XXX XXX BBB5 XXX XXX XXX CCC1 XXX XXX XXX CCC2 XXX XXX XXX CCC3 XXX XXX XXX CCC4 XXX XXX XXX CCC5 XXX XXX XXX CCC6 XXX XXX XXX CCC7 XXX XXX XXX DDD1 XXX XXX XXX DDD2 XXX XXX XXX DDD3 XXX XXX XXX DDD4 XXX XXX XXX DDD5 XXX XXX XXX DDD6 XXX XXX XXX DDD7 XXX XXX XXX I have a 2nd table, given as 1 AAA 2 DDD 3 CCC Since AAA,DDD and CCC

一文读懂大数据计算框架与平台 (转)

淺唱寂寞╮ 提交于 2019-12-01 13:22:23
1. 前言 计算机的基本工作就是处理数据,包括磁盘文件中的数据,通过网络传输的数据流或数据包,数据库中的结构化数据等。随着互联网、物联网等技术得到越来越广泛的应用,数据规模不断增加,TB、PB量级成为常态,对数据的处理已无法由单台计算机完成,而只能由多台机器共同承担计算任务。而在分布式环境中进行大数据处理,除了与存储系统打交道外,还涉及计算任务的分工,计算负荷的分配,计算机之间的数据迁移等工作,并且要考虑计算机或网络发生故障时的 数据安全 ,情况要复杂得多。 举一个简单的例子,假设我们要从销售记录中统计各种商品销售额。在单机环境中,我们只需把销售记录扫描一遍,对各商品的销售额进行累加即可。如果销售记录存放在关系数据库中,则更省事,执行一个SQL语句就可以了。现在假定销售记录实在太多,需要设计出由多台计算机来统计销售额的方案。为保证计算的正确、可靠、高效及方便,这个方案需要考虑下列问题: 如何为每台机器分配任务,是先按商品种类对销售记录分组,不同机器处理不同商品种类的销售记录,还是随机向各台机器分发一部分销售记录进行统计,最后把各台机器的统计结果按商品种类合并? 上述两种方式都涉及数据的排序问题,应选择哪种排序算法?应该在哪台机器上执行排序过程? 如何定义每台机器处理的数据从哪里来,处理结果到哪里去?数据是主动发送,还是接收方申请时才发送?如果是主动发送,接收方处理不过来怎么办

Trying to get spark streaming to read data stream from website, what is the socket?

谁说胖子不能爱 提交于 2019-12-01 12:52:45
问题 I am trying to get this data http://stream.meetup.com/2/rsvps into spark stream They are JSON objects, I know the lines will be strings, I just want it to work before I try JSON. I am not sure what to put as the port, I assume that is the problem. SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("Spark Streaming"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); JavaReceiverInputDStream<String> lines = jssc.socketTextStream("http://stream

Spark-Java-算子

99封情书 提交于 2019-12-01 12:11:33
package scala.spark.Day3; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.VoidFunction; import java.util.Arrays; import java.util.List; /** * Created by Administrator on 2019/10/16. */ public class JavaRDDTest { public static void main(String[] args) { System.setProperty("hadoop.home.dir", "E:\\hadoop-2.6.0-cdh5.15.0\\hadoop-2.6.0-cdh5.15.0"); //JavaRDD 标准RDD //JavaPairRDD PairRDD //JavaDoubleRDD DoubleRDD /

When does a RDD lineage is created? How to find lineage graph?

两盒软妹~` 提交于 2019-12-01 11:37:41
问题 I am learning Apache Spark and trying to get the lineage graph of the RDDs. But i could not find when does a particular lineage is created? Also, where to find the lineage of an RDD? 回答1: RDD Lineage  is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD. Note the part "logical" not "physical" that happens after you've executed an action. Quoting Mastering Apache Spark 2 gitbook: RDD Lineage (aka RDD operator

What does Spark recover the data from a failed node?

南笙酒味 提交于 2019-12-01 11:27:34
Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method. So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions. So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism? When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation

Return RDD of largest N values from another RDD in SPARK

一曲冷凌霜 提交于 2019-12-01 11:09:36
I'm trying to filter an RDD of tuples to return the largest N tuples based on key values. I need the return format to be an RDD. So the RDD: [(4, 'a'), (12, 'e'), (2, 'u'), (49, 'y'), (6, 'p')] filtered for the largest 3 keys should return the RDD: [(6,'p'), (12,'e'), (49,'y')] Doing a sortByKey() and then take(N) returns the values and doesn't result in an RDD, so that won't work. I could return all of the keys, sort them, find the Nth largest value, and then filter the RDD for key values greater than that, but that seems very inefficient. What would be the best way to do this? zero323 With