How do I run graphx with Python / pyspark?

久未见 提交于 2019-12-03 09:11:37

问题


I am attempting to run Spark graphx with Python using pyspark. My installation appears correct, as I am able to run the pyspark tutorials and the (Java) GraphX tutorials just fine. Presumably since GraphX is part of Spark, pyspark should be able to interface it, correct?

Here are the tutorials for pyspark: http://spark.apache.org/docs/0.9.0/quick-start.html http://spark.apache.org/docs/0.9.0/python-programming-guide.html

Here are the ones for GraphX: http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html

Can anyone convert the GraphX tutorial to be in Python?


回答1:


It looks like the python bindings to GraphX are delayed at least to Spark 1.4 1.5 ∞. It is waiting behind the Java API.

You can track the status at SPARK-3789 GRAPHX Python bindings for GraphX - ASF JIRA




回答2:


You should look at GraphFrames (https://github.com/graphframes/graphframes), which wraps GraphX algorithms under the DataFrames API and it provides Python interface.

Here is a quick example from https://graphframes.github.io/graphframes/docs/_site/quick-start.html, with slight modification so that it works

first start pyspark with the graphframes pkg loaded

pyspark --packages graphframes:graphframes:0.1.0-spark1.6

python code:

from graphframes import *

# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()



回答3:


GraphX 0.9.0 doesn't have python API yet. It's expected in upcoming releases.



来源:https://stackoverflow.com/questions/23302270/how-do-i-run-graphx-with-python-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!