running pyspark.mllib on Ubuntu

问题

I'm trying to link Spark in python. Codes bellow is test.py, and I put it under ~/spark/python:

from pyspark import SparkContext, SparkConf
from pyspark.mllib.fpm import FPGrowth
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

And I run python test.py get this error messge:

Exception in thread "main" java.lang.IllegalStateException: Library directory '/home/user/spark/lib_managed/jars' does not exist.
        at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
        at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:208)
        at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:119)
        at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:195)
        at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
        at org.apache.spark.launcher.Main.main(Main.java:86)
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    conf = SparkConf().setAppName(appName).setMaster(master)
  File "/home/user/spark/python/pyspark/conf.py", line 104, in __init__
    SparkContext._ensure_initialized()
  File "/home/user/spark/python/pyspark/context.py", line 245, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway()
  File "/home/user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
    raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

I move test.py to ~/spark, and I get:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark

I clone Spark project from the official website. OS system: Ubuntu Java version: 1.7.0_79 Python version: 2.7.11

Can anyone give me some tips to solve this problem?

回答1:

Spark programs must be submitted through "Spark-submit". More info: Documentation.

You should try running: $SPARK_HOME/bin/spark-submit test.py instead of python test.py.

回答2:

Check this out if you haven't set SPARK_HOME and add its lib to PYTHONPATH.

Also,

I clone Spark project from the official website

This is not recommended as it could give a lot of dependency issues. You could try download a pre-built version with Hadoop, then test it in the local mode with instructions here.

来源：https://stackoverflow.com/questions/38323267/running-pyspark-mllib-on-ubuntu

标签

python-2.7

Ubuntu

apache-spark

pyspark

apache-spark-mllib