apache-spark | 易学教程

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark

问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark

Spark dataframe add new column with random data

阅读更多关于 Spark dataframe add new column with random data

问题 I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from, from random import randint df1 = df.withColumn('isVal',randint(0,1)) But I get the following error, /spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column how to use a custom function or randint function for generate random value for the column? 回答1: You are using python builtin

Spark dataframe add new column with random data

阅读更多关于 Spark dataframe add new column with random data

Limiting maximum size of dataframe partition

阅读更多关于 Limiting maximum size of dataframe partition

问题 When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write multiple times and increase the argument to repartition each time. Is there a way I can calculate ahead of time what argument to use for repartition to ensure the max size of each file is less than some specified size. I imagine there might be pathological cases where all the data ends up on one partition. So make the

Limiting maximum size of dataframe partition

阅读更多关于 Limiting maximum size of dataframe partition

Exception while deleting Spark temp dir in Windows 7 64 bit

阅读更多关于 Exception while deleting Spark temp dir in Windows 7 64 bit

问题 I am trying to run unit test of spark job in windows 7 64 bit. I have HADOOP_HOME=D:/winutils winutils path= D:/winutils/bin/winutils.exe I ran below commands: winutils ls \tmp\hive winutils chmod -R 777 \tmp\hive But when I run my test I get the below error. Running com.dnb.trade.ui.ingest.spark.utils.ExperiencesUtilTest Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.132 sec 17/01/24 15:37:53 INFO Remoting: Remoting shut down 17/01/24 15:37:53 ERROR ShutdownHookManager:

Join two RDD in spark

阅读更多关于 Join two RDD in spark

问题 I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ? val lines = sc.textFile("ml-100k/u.data") val movienamesfile = sc.textFile("Cml-100k/u.item") val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0)) val test = moviesid.map(x => x._1) val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1))) val shit = movienames.join

Join two RDD in spark

阅读更多关于 Join two RDD in spark

MongoDB Spark Connector - aggregation is slow

阅读更多关于 MongoDB Spark Connector - aggregation is slow

问题 I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second use of "it" is needed to retrieve all expected data. The Spark Application however takes almost two minutes according to the Spark WebUI. As you can see, 242 tasks are being launched to fetch the result. I am not sure why such an high amount of tasks is launched while there are only 40 documents being returned by the