pyspark

Junk Spark output file on S3 with dollar signs

谁说胖子不能爱 提交于 2019-12-25 07:59:51
问题 I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$. What is it? How I can prevent spark from creating it? Here is some code to show what I am doing... x = spark.sparkContext.textFile("s3n://.../0000_part_00") five = x.take(5) five = spark.sparkContext.parallelize(five) five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions

RDD to multidimensional array

倾然丶 夕夏残阳落幕 提交于 2019-12-25 07:55:09
问题 I am using spark's python API and I am finding a few matrix operations challenging. My RDD is one dimensional list of length n (row vector). Is it possible to reshape it to a matrix/multidimensional array of size sq_root(n) x Sq_root(n). for example, Vec=[1,2,3,4,5,6,7,8,9] and desired output 3 x 3= [[1,2,3] [4,5,6] [7,8,9]] Is there an equivalent to reshape in numpy? Conditions: n (>50 million) is huge so that rules out using .collect(), and can this process be made to run on multiple

Spark Installation and Configuration on MacOS ImportError: No module named pyspark

 ̄綄美尐妖づ 提交于 2019-12-25 07:46:22
问题 I'm trying to configure apache-spark on MacOS. All the online guides ask to either download the spark tar and set up some env variables or to use brew install apache-spark and then setup some env variables. Now I installed apache-spark using brew install apache-spark . I run pyspark in terminal and I am getting a python prompt which suggests that the installation was successful. Now when I try to do import pyspark into my python file, I'm facing error saying ImportError: No module named

Is there a way to persist or save the pipeline model in pyspark 1.6?

…衆ロ難τιáo~ 提交于 2019-12-25 07:44:07
问题 I understand that this is a duplicate question which was asked here saving pipeline model in pyspark 1.6 but there is still no definite answer to it. Can anyone please suggest anything? joblib or cPickle doesn't work as it gives the same error which is given in the previous link. Is there a way to save the pipeline in PySpark 1.6 or there isn't? The questions that I saw regarding model persistence were mainly related to persisting ML models. Saving a pipeline is the altogether differnt issue.

Is there a way to persist or save the pipeline model in pyspark 1.6?

给你一囗甜甜゛ 提交于 2019-12-25 07:43:39
问题 I understand that this is a duplicate question which was asked here saving pipeline model in pyspark 1.6 but there is still no definite answer to it. Can anyone please suggest anything? joblib or cPickle doesn't work as it gives the same error which is given in the previous link. Is there a way to save the pipeline in PySpark 1.6 or there isn't? The questions that I saw regarding model persistence were mainly related to persisting ML models. Saving a pipeline is the altogether differnt issue.

how to use Spark-submit configuration: jars,packages:in cluster mode?

眉间皱痕 提交于 2019-12-25 07:26:50
问题 When use Spark-submit in cluster mode(yarn-cluster),jars and packages configuration confused me: for jars, i can put them in HDFS, instead of in local directory . But for packages, because they build with Maven, with HDFS,it can't work. my way like below: spark-submit --jars hdfs:///mysql-connector-java-5.1.39-bin.jar --driver-class-path /home/liac/test/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar --conf "spark.mongodb.input.uri=mongodb://192.168.27.234/test.myCollection2

How to train SparkML gradient boosting classifer given a RDD

北慕城南 提交于 2019-12-25 07:21:15
问题 Given the following rdd training_rdd = rdd.select( # Categorical features col('device_os'), # 'ios', 'android' # Numeric features col('30day_click_count'), col('30day_impression_count'), np.true_divide(col('30day_click_count'), col('30day_impression_count')).alias('30day_click_through_rate'), # label col('did_click').alias('label') ) I am confused about the syntax to train a gradient boosting classifer. I am following the this tutorial. https://spark.apache.org/docs/latest/ml-classification

data load time when using spark with oracle

拜拜、爱过 提交于 2019-12-25 06:23:12
问题 I am trying to load data from oracle to spark in juypter notebook.But each time I try to pot graph the time taken is huge. How do I make it faster? query = "(select * from db.schema where lqtime between trunc(sysdate)-30 and trunc(sysdate) )" %time df = sqlContext.read.format('jdbc').options(url="jdbc:oracle:thin:useradmin/pass12@//localhost:1521/aldb",dbtable=query,driver="oracle.jdbc.OracleDriver").load() Now I group by node: %time fo_node = df.select('NODE').groupBy('NODE').count().sort(

What exactly is the initializationSteps parameter in Kmeans++ in Spark MLLib?

可紊 提交于 2019-12-25 05:51:21
问题 I know what k-means is and I also understand what k-means++ algorithm is. I believe the only change is the way the initial K centers are found. In the ++ version we initially choose a center and using a probability distribution we choose the remaining k-1 centers. In the MLLib algorithm for k-means what is the initializationSteps parameter? 回答1: To be precise k-means++ is an algorithm for choosing initial centers and it doesn't describe a whole training process. MLLib k-means is using k-means

Error message when launching PySpark from Jupyter notebook on Windows

眉间皱痕 提交于 2019-12-25 05:09:05
问题 This same approach to run Apache spark on Jupyter used to work, but now it is throwing Exception: Java gateway process exited before sending the driver its port number Here is the configuration in Jupyter notebook which was working previously. import os import sys spark_home = os.environ.get('SPARK_HOME', None) print(spark_home) spark_home= spark_home+"/python" sys.path.insert(0, spark_home) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1- src.zip')) filename = os.path