apache-spark | 易学教程

MongoDB Spark Connector - aggregation is slow

阅读更多关于 MongoDB Spark Connector - aggregation is slow

问题 I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second use of "it" is needed to retrieve all expected data. The Spark Application however takes almost two minutes according to the Spark WebUI. As you can see, 242 tasks are being launched to fetch the result. I am not sure why such an high amount of tasks is launched while there are only 40 documents being returned by the

MongoDB Spark Connector - aggregation is slow

阅读更多关于 MongoDB Spark Connector - aggregation is slow

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

Where does spark look for text files?

阅读更多关于 Where does spark look for text files?

问题 I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path) e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode)

YARN not preempting resources based on fair shares when running a Spark job

阅读更多关于 YARN not preempting resources based on fair shares when running a Spark job

问题 I have a problem with re-balancing Apache Spark jobs resources on YARN Fair Scheduled queues. For the tests I've configured Hadoop 2.6 (tried 2.7 also) to run in pseudo-distributed mode with local HDFS on MacOS. For job submission used "Pre-build Spark 1.4 for Hadoop 2.6 and later" (tried 1.5 also) distribution from Spark's website. When tested with basic configuration on Hadoop MapReduce jobs, Fair Scheduler works as expected: When resources of the cluster exceed some maximum, fair shares

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

阅读更多关于 Implementing a recursive algorithm in pyspark to find pairings within a dataframe

问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

阅读更多关于 Implementing a recursive algorithm in pyspark to find pairings within a dataframe