apache-spark

MongoDB Spark Connector - aggregation is slow

荒凉一梦 提交于 2021-02-06 12:50:32
问题 I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second use of "it" is needed to retrieve all expected data. The Spark Application however takes almost two minutes according to the Spark WebUI. As you can see, 242 tasks are being launched to fetch the result. I am not sure why such an high amount of tasks is launched while there are only 40 documents being returned by the

MongoDB Spark Connector - aggregation is slow

非 Y 不嫁゛ 提交于 2021-02-06 12:49:07
问题 I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second use of "it" is needed to retrieve all expected data. The Spark Application however takes almost two minutes according to the Spark WebUI. As you can see, 242 tasks are being launched to fetch the result. I am not sure why such an high amount of tasks is launched while there are only 40 documents being returned by the

convert dataframe to libsvm format

心不动则不痛 提交于 2021-02-06 11:11:59
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

*爱你&永不变心* 提交于 2021-02-06 11:10:41
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

不羁的心 提交于 2021-02-06 11:07:38
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

依然范特西╮ 提交于 2021-02-06 11:05:59
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

Where does spark look for text files?

三世轮回 提交于 2021-02-06 10:18:28
问题 I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path) e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode)

YARN not preempting resources based on fair shares when running a Spark job

只愿长相守 提交于 2021-02-06 09:50:10
问题 I have a problem with re-balancing Apache Spark jobs resources on YARN Fair Scheduled queues. For the tests I've configured Hadoop 2.6 (tried 2.7 also) to run in pseudo-distributed mode with local HDFS on MacOS. For job submission used "Pre-build Spark 1.4 for Hadoop 2.6 and later" (tried 1.5 also) distribution from Spark's website. When tested with basic configuration on Hadoop MapReduce jobs, Fair Scheduler works as expected: When resources of the cluster exceed some maximum, fair shares

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

我的梦境 提交于 2021-02-06 09:22:31
问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

懵懂的女人 提交于 2021-02-06 09:21:11
问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.