pyspark | 易学教程

Junk Spark output file on S3 with dollar signs

阅读更多关于 Junk Spark output file on S3 with dollar signs

问题 I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$. What is it? How I can prevent spark from creating it? Here is some code to show what I am doing... x = spark.sparkContext.textFile("s3n://.../0000_part_00") five = x.take(5) five = spark.sparkContext.parallelize(five) five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions

RDD to multidimensional array

阅读更多关于 RDD to multidimensional array

问题 I am using spark's python API and I am finding a few matrix operations challenging. My RDD is one dimensional list of length n (row vector). Is it possible to reshape it to a matrix/multidimensional array of size sq_root(n) x Sq_root(n). for example, Vec=[1,2,3,4,5,6,7,8,9] and desired output 3 x 3= [[1,2,3] [4,5,6] [7,8,9]] Is there an equivalent to reshape in numpy? Conditions: n (>50 million) is huge so that rules out using .collect(), and can this process be made to run on multiple

Spark Installation and Configuration on MacOS ImportError: No module named pyspark

阅读更多关于 Spark Installation and Configuration on MacOS ImportError: No module named pyspark

问题 I'm trying to configure apache-spark on MacOS. All the online guides ask to either download the spark tar and set up some env variables or to use brew install apache-spark and then setup some env variables. Now I installed apache-spark using brew install apache-spark . I run pyspark in terminal and I am getting a python prompt which suggests that the installation was successful. Now when I try to do import pyspark into my python file, I'm facing error saying ImportError: No module named

Is there a way to persist or save the pipeline model in pyspark 1.6?

阅读更多关于 Is there a way to persist or save the pipeline model in pyspark 1.6?

问题 I understand that this is a duplicate question which was asked here saving pipeline model in pyspark 1.6 but there is still no definite answer to it. Can anyone please suggest anything? joblib or cPickle doesn't work as it gives the same error which is given in the previous link. Is there a way to save the pipeline in PySpark 1.6 or there isn't? The questions that I saw regarding model persistence were mainly related to persisting ML models. Saving a pipeline is the altogether differnt issue.

Is there a way to persist or save the pipeline model in pyspark 1.6?

阅读更多关于 Is there a way to persist or save the pipeline model in pyspark 1.6?

how to use Spark-submit configuration: jars,packages:in cluster mode?

阅读更多关于 how to use Spark-submit configuration: jars,packages:in cluster mode?

问题 When use Spark-submit in cluster mode(yarn-cluster),jars and packages configuration confused me: for jars, i can put them in HDFS, instead of in local directory . But for packages, because they build with Maven, with HDFS,it can't work. my way like below: spark-submit --jars hdfs:///mysql-connector-java-5.1.39-bin.jar --driver-class-path /home/liac/test/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar --conf "spark.mongodb.input.uri=mongodb://192.168.27.234/test.myCollection2

How to train SparkML gradient boosting classifer given a RDD

阅读更多关于 How to train SparkML gradient boosting classifer given a RDD

问题 Given the following rdd training_rdd = rdd.select( # Categorical features col('device_os'), # 'ios', 'android' # Numeric features col('30day_click_count'), col('30day_impression_count'), np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'), # label col('did_click').alias('label') ) I am confused about the syntax to train a gradient boosting classifer. I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification

data load time when using spark with oracle

阅读更多关于 data load time when using spark with oracle

问题 I am trying to load data from oracle to spark in juypter notebook.But each time I try to pot graph the time taken is huge. How do I make it faster? query = "(select * from db.schema where lqtime between trunc(sysdate)-30 and trunc(sysdate) )" %time df = sqlContext.read.format('jdbc').options(url="jdbc:oracle:thin:useradmin/pass12@//localhost:1521/aldb",dbtable=query,driver="oracle.jdbc.OracleDriver").load() Now I group by node: %time fo_node = df.select('NODE').groupBy('NODE').count().sort(

What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?

阅读更多关于 What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?

问题 I know what k-means is and I also understand what k-means++ algorithm is. I believe the only change is the way the initial K centers are found. In the ++ version we initially choose a center and using a probability distribution we choose the remaining k-1 centers. In the MLLib algorithm for k-means what is the initializationSteps parameter? 回答1: To be precise k-means++ is an algorithm for choosing initial centers and it doesn't describe a whole training process. MLLib k-means is using k-means

Error message when launching PySpark from Jupyter notebook on Windows

阅读更多关于 Error message when launching PySpark from Jupyter notebook on Windows

问题 This same approach to run Apache spark on Jupyter used to work, but now it is throwing Exception: Java gateway process exited before sending the driver its port number Here is the configuration in Jupyter notebook which was working previously. import os import sys spark_home = os.environ.get('SPARK_HOME', None) print(spark_home) spark_home= spark_home+"/python" sys.path.insert(0, spark_home) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1- src.zip')) filename = os.path