I\'m trying to run this example from apache.spark.org (code is below & entire tutorial is here: https://spark.apache.org/docs/latest/mllib-feature-extraction.html) using
First of all, I am a newcomer to Spark, so others may have quicker or better solutions. I ran into the same difficulties to run this sample code. I manage to make it work, mainly by:
spark-env.sh:
export SPARK_MASTER_IP=192.168.1.53
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_DAEMON_MEMORY=1G
# Worker : 1 by server
# Number of worker instances to run on each machine (default: 1).
# You can make this more than 1 if you have have very large machines and would like multiple Spark worker processes.
# If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker,
# or else each worker will try to use all the cores.
export SPARK_WORKER_INSTANCES=2
# Total number of cores to allow Spark applications to use on the machine (default: all available cores).
export SPARK_WORKER_CORES=7
#Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g
# (default: total memory minus 1 GB);
# note that each application's individual memory is configured using its spark.executor.memory property.
export SPARK_WORKER_MEMORY=8G
export SPARK_WORKER_DIR=/tmp
# Executor : 1 by application run on the server
# export SPARK_EXECUTOR_INSTANCES=4
# export SPARK_EXECUTOR_MEMORY=4G
export SPARK_SCALA_VERSION="2.10"
Scala file to run the example:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
object SparkDemo {
def log[A](key:String)(job : =>A) = {
val start = System.currentTimeMillis
val output = job
println("===> %s in %s seconds"
.format(key, (System.currentTimeMillis - start) / 1000.0))
output
}
def main(args: Array[String]):Unit ={
val modelName ="w2vModel"
val sc = new SparkContext(
new SparkConf()
.setAppName("SparkDemo")
.set("spark.executor.memory", "8G")
.set("spark.driver.maxResultSize", "16G")
.setMaster("spark://192.168.1.53:7077") // ip of the spark master.
// .setMaster("local[2]") // does not work... workers loose contact with the master after 120s
)
// take a look into target folder if you are unsure how the jar is named
// onliner to compile / run : sbt package && sbt run
sc.addJar("./target/scala-2.10/sparkling_2.10-0.1.jar")
val input = sc.textFile("./text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = log("compute model") { word2vec.fit(input) }
log ("save model") { model.save(sc, modelName) }
val synonyms = model.findSynonyms("china", 40)
for((synonym, cosineSimilarity) <- synonyms) {
println(s"$synonym $cosineSimilarity")
}
val model2 = log("reload model") { Word2VecModel.load(sc, modelName) }
}
}