apache-spark | 易学教程

Setting “spark.memory.storageFraction” in Spark does not work

阅读更多关于 Setting “spark.memory.storageFraction” in Spark does not work

问题 I am trying to tune the memory parameter of Spark. I tried: sparkSession.conf.set("spark.memory.storageFraction","0.1") //sparkSession has been created After I submit the job and checked Spark UI. I found "Storage Memory" is still as before. So the above did not work. What is the correct way to set "spark.memory.storageFraction"? I am using Spark 2.0. 回答1: I face same problem , after read some code from spark github I think the "Storage Memory" on spark ui is misleading, it's not indicate the

Pyspark command not recognised

阅读更多关于 Pyspark command not recognised

问题 I have anaconda installed and also I have downloaded Spark 1.6.2. I am using the following instructions from this answer to configure spark for Jupyter enter link description here I have downloaded and unzipped the spark directory as ~/spark Now when I cd into this directory and into bin I see the following SFOM00618927A:spark $ cd bin SFOM00618927A:bin $ ls beeline pyspark run-example.cmd spark-class2.cmd spark-sql sparkR beeline.cmd pyspark.cmd run-example2.cmd spark-shell spark-submit

Pyspark command not recognised

阅读更多关于 Pyspark command not recognised

Java & Spark : add unique incremental id to dataset

阅读更多关于 Java & Spark : add unique incremental id to dataset

问题 With Spark and Java, I am trying to add to an existing Dataset[Row] with n columns an Integer identify column. I successfully added an id with zipWithUniqueId() or with zipWithIndex , even using monotonically_increasing_id() . But neither one gives satisfaction. Example : I have one dataset with 195 rows. When I use one of these three methods, i get some id like 1584156487 or 12036 . Plus, those id's are not contiguous. What i need/want is rather simply : an Integer id column, which value

Spark.ml regressions do not calculate same models as scikit-learn

阅读更多关于 Spark.ml regressions do not calculate same models as scikit-learn

问题 I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart? I give the sklearn code and spark.ml code below. Both should be ready to cut

How to insert a custom function within For loop in pyspark?

阅读更多关于 How to insert a custom function within For loop in pyspark?

问题 I am facing a challenge in spark within Azure databricks. I have a dataset as +------------------+----------+-------------------+---------------+ | OpptyHeaderID| OpptyID| Date|BaseAmountMonth| +------------------+----------+-------------------+---------------+ |0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| |0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| +------------------+----------+-------------------+---------------+ Now I need to use a loop function to

Spark Streaming MQTT

阅读更多关于 Spark Streaming MQTT

问题 I've been using spark to stream data from kafka and it's pretty easy. I thought using the MQTT utils would also be easy, but it is not for some reason. I'm trying to execute the following piece of code. val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local") val ssc = new StreamingContext(sparkConf, Seconds(10)) val actorSystem = ActorSystem() implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor]) MQTTUtils.createStream(ssc, "tcp://localhost

Spark Streaming MQTT

阅读更多关于 Spark Streaming MQTT

How to enable Postgis Query in Spark SQL

阅读更多关于 How to enable Postgis Query in Spark SQL

问题 I have a PostgreSQL database with Postgis extension, so I can do queries like: SELECT * FROM poi_table WHERE (ST_DistanceSphere(the_geom, ST_GeomFromText('POINT(121.37796 31.208297)', 4326)) < 6000) And with Spark SQL, I can query the table in my Spark Application (in Scala) like: spark.sql("select the_geom from poi_table where the_geom is not null").show The problem is, Spark SQL doesn't support Postgis extension. For example, when I query the table using Postgis function ST_DistanceSphere ,

Spark Streaming MQTT

阅读更多关于 Spark Streaming MQTT