spark-graphx

How to find membership of vertices using Graphframes or igraph or networx in pyspark

放肆的年华 提交于 2019-12-25 01:49:01
问题 my input dataframe is df valx valy 1: 600060 09283744 2: 600131 96733110 3: 600194 01700001 and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership . I have tried Graphframes in pyspark and networx library too, but not getting desired results My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2) V1 V2 600060 1 96733110 1

Spark GraphX - How can I read from a JSON file in Spark and create a graph from the data?

家住魔仙堡 提交于 2019-12-24 14:17:16
问题 I'm new to Spark and Scala, and I am trying to read a bunch of tweeter data from a JSON file and turn that into a graph where a vertex represents a tweet and the edge connects to tweets which are a re-tweet of the original posted item. So far I have managed to read from the JSON file and figure out the Schema of my RDD. Now I believe I need to somehow take the data from the SchemaRDD object and create an RDD for the Vertices and an RDD for the edges. Is this the way to approach this or is

Generate `VertexId` from pairs of `String`

南楼画角 提交于 2019-12-24 12:03:09
问题 I'm using GraphX to process some graph data on Spark. The input data is given as RDD[(String, String)] . I used the following snippet to map String to VertexId and build the graph. val input: RDD[(String, String)] = ... val vertexIds = input.map(_._1) .union(input.map(_._2)) .distinct() .zipWithUniqueId() .cache() val edges = input.join(vertexIds) .map { case (u, (v, uid)) => (v, uid) } .join(vertexIds) .map { case (v, (uid, vid)) => Edge(uid, vid, 1) } val graph = Graph(vertexIds.map { case

Timeout Exception in Apache-Spark during program Execution

孤者浪人 提交于 2019-12-22 04:12:12
问题 I am running a Bash Script in MAC. This script calls a spark method written in Scala language for a large number of times. I am currently trying to call this spark method for 100,000 times using a for loop. The code exits with the following exception after running a small number of iterations, around 3000 iterations. org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout

Spark Scala - Joining two arrays by VertexID

这一生的挚爱 提交于 2019-12-13 08:09:12
问题 I have 2 arrays in the following format scala> cPV.take(5) res18: Array[(org.apache.spark.graphx.VertexId, String)] = Array((-496366541,7804412), (183389035,11517829), (1300761459,36164965), (978932066,32135154), (370291237,40355685)) scala> fC.take(5) res19: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((386253628,1), (-1141923433,1), (1871855296,7), (1938255756,1), (-749015657,5)) I need to join them to get into the format - Array[(org.apache.spark.graphx.VertexId, Int, String)] I

How to export Spark GraphX graph to Gephi using scala

怎甘沉沦 提交于 2019-12-13 04:00:34
问题 I have graph in Spark collected from different data sources. Is there simple way to export Spark GraphX graph to Gephi for visualization using scala? Any common data formats? 回答1: As far as I am concerned the only way you can export graph directly is to use some variation of CSV. All other formats supported by Gephi cannot be easily written in parallel. Problem with using basic CSV is that it doesn't support attributes. Since amount of data you can visualize using Gephi is rather limited a

how to sum edge weights with graphx

瘦欲@ 提交于 2019-12-13 02:20:41
问题 I have a Graph[Int, Int], where each edge has a weight value. What I want to do is, for each user, to collect all in-edges and sum the weight associated to each of them. Say data is like: import org.apache.spark.graphx._ val sc: SparkContext // Create an RDD for the vertices val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L,("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof")))) // Create an RDD for edges val

how to compute vertex similarity to neighbors in graphx

不问归期 提交于 2019-12-12 04:17:34
问题 Suppose to have a simple graph like: val users = sc.parallelize(Array( (1L, Seq("M", 2014, 40376, null, "N", 1, "Rajastan")), (2L, Seq("M", 2009, 20231, null, "N", 1, "Rajastan")), (3L, Seq("F", 2016, 40376, null, "N", 1, "Rajastan")) )) val edges = sc.parallelize(Array( Edge(1L, 2L, ""), Edge(1L, 3L, ""), Edge(2L, 3L, ""))) val graph = Graph(users, edges) I'd like to compute how much each vertex is similar to its neighbors on each attribute. The ideal output (an RDD or DataFrame) would hold

What kind of variable select for incrementing node labels in a community detection algorithm

血红的双手。 提交于 2019-12-11 17:15:04
问题 i am working on a community detection algorithm that uses the concept of propagating label to nodes. i have problem in selecting the true type for the Label_counter variable. we have an algorithm with name LPA(label propagation algorithm) which propagates labels to nodes through iterations. think labels as node property. the initial label for each node is the node id, and in iterations nodes update their new label based on the most frequent label among its neighbors. the algorithm i am

efficiently calculating connected components in pyspark

只谈情不闲聊 提交于 2019-12-11 11:02:36
问题 I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city. City | SRC | DEST Houston Kyle -> Benny Houston Benny -> Charles Houston Charles -> Denny Omaha Carol -> Brian etc. I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so? edit: I thought I could do something like select connected_components