apache-spark

Performance decrease for huge amount of columns. Pyspark

可紊 提交于 2021-02-06 20:10:08
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

允我心安 提交于 2021-02-06 20:09:07
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Spark dataframe add new column with random data

淺唱寂寞╮ 提交于 2021-02-06 16:07:05
问题 I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from, from random import randint df1 = df.withColumn('isVal',randint(0,1)) But I get the following error, /spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column how to use a custom function or randint function for generate random value for the column? 回答1: You are using python builtin

Spark dataframe add new column with random data

被刻印的时光 ゝ 提交于 2021-02-06 16:01:54
问题 I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from, from random import randint df1 = df.withColumn('isVal',randint(0,1)) But I get the following error, /spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column how to use a custom function or randint function for generate random value for the column? 回答1: You are using python builtin

Limiting maximum size of dataframe partition

梦想的初衷 提交于 2021-02-06 15:47:46
问题 When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write multiple times and increase the argument to repartition each time. Is there a way I can calculate ahead of time what argument to use for repartition to ensure the max size of each file is less than some specified size. I imagine there might be pathological cases where all the data ends up on one partition. So make the

Limiting maximum size of dataframe partition

喜夏-厌秋 提交于 2021-02-06 15:47:41
问题 When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write multiple times and increase the argument to repartition each time. Is there a way I can calculate ahead of time what argument to use for repartition to ensure the max size of each file is less than some specified size. I imagine there might be pathological cases where all the data ends up on one partition. So make the

Exception while deleting Spark temp dir in Windows 7 64 bit

梦想的初衷 提交于 2021-02-06 15:14:31
问题 I am trying to run unit test of spark job in windows 7 64 bit. I have HADOOP_HOME=D:/winutils winutils path= D:/winutils/bin/winutils.exe I ran below commands: winutils ls \tmp\hive winutils chmod -R 777 \tmp\hive But when I run my test I get the below error. Running com.dnb.trade.ui.ingest.spark.utils.ExperiencesUtilTest Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.132 sec 17/01/24 15:37:53 INFO Remoting: Remoting shut down 17/01/24 15:37:53 ERROR ShutdownHookManager:

Join two RDD in spark

偶尔善良 提交于 2021-02-06 14:00:02
问题 I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ? val lines = sc.textFile("ml-100k/u.data") val movienamesfile = sc.textFile("Cml-100k/u.item") val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0)) val test = moviesid.map(x => x._1) val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1))) val shit = movienames.join

Join two RDD in spark

强颜欢笑 提交于 2021-02-06 13:57:21
问题 I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ? val lines = sc.textFile("ml-100k/u.data") val movienamesfile = sc.textFile("Cml-100k/u.item") val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0)) val test = moviesid.map(x => x._1) val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1))) val shit = movienames.join

MongoDB Spark Connector - aggregation is slow

天涯浪子 提交于 2021-02-06 12:50:36
问题 I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second use of "it" is needed to retrieve all expected data. The Spark Application however takes almost two minutes according to the Spark WebUI. As you can see, 242 tasks are being launched to fetch the result. I am not sure why such an high amount of tasks is launched while there are only 40 documents being returned by the