pyspark

pyspark 2.2.0 concept behind raw predictions field of logistic regression model

时间秒杀一切 提交于 2020-01-05 04:25:31
问题 I was trying to understand the concept of the output generated from logistic regression model in Pyspark. Could anyone please explain the concept behind the rawPrediction field calculation generated from a logistic regression model? Thanks. 回答1: In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation: The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger =

How to configure the log level of a specific logger using log4j in pyspark?

落花浮王杯 提交于 2020-01-05 04:19:05
问题 From this StackOverflow thread, I know how to obtain and use the log4j logger in pyspark like so: from pyspark import SparkContext sc = SparkContext() log4jLogger = sc._jvm.org.apache.log4j LOGGER = log4jLogger.LogManager.getLogger('MYLOGGER') LOGGER.info("pyspark script logger initialized") Which works fine with the spark-submit script. My question is how to modify the log4j.properties file to configure the log level for this particular logger or how to configure it dynamically? 回答1: There

Create a tuple out of two columns - PySpark

时间秒杀一切 提交于 2020-01-05 04:10:12
问题 My problem is based on the similar question here PySpark: Add a new column with a tuple created from columns, with the difference that I have a list of values instead of one value per column. For example: from pyspark.sql import Row df = sqlContext.createDataFrame([Row(v1=[u'2.0', u'1.0', u'9.0'], v2=[u'9.0', u'7.0', u'2.0']),Row(v1=[u'4.0', u'8.0', u'9.0'], v2=[u'1.0', u'1.0', u'2.0'])]) +---------------+---------------+ | v1| v2| +---------------+---------------+ |[2.0, 1.0, 9.0]|[9.0, 7.0,

XGBoost Spark One Model Per Worker Integration

走远了吗. 提交于 2020-01-05 04:08:11
问题 Trying to work through this notebook https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html. Using spark version 2.4.3 and xgboost 0.90 Keep getting this error ValueError: bad input shape () when trying to execute ... features = inputTrainingDF.select("features").collect() lables = inputTrainingDF.select("label").collect() X = np.asarray(map(lambda v: v[0].toArray(), features)) Y = np

spark repartition fall into single partition

*爱你&永不变心* 提交于 2020-01-05 04:01:52
问题 I am learning spark, and when I tested repartition() function in pyspark shell with the following expression, I observed a very strange result: all elements fall into the same partition after repartition() function. Here, I used glom() to learn about the partitioning within the rdd. I was expecting repartition() to shuffle the elements and randomly distribute them among partitions. This only happens when I repartition with new number of partitions <= original partitions. During my test, if I

spark repartition fall into single partition

∥☆過路亽.° 提交于 2020-01-05 04:01:10
问题 I am learning spark, and when I tested repartition() function in pyspark shell with the following expression, I observed a very strange result: all elements fall into the same partition after repartition() function. Here, I used glom() to learn about the partitioning within the rdd. I was expecting repartition() to shuffle the elements and randomly distribute them among partitions. This only happens when I repartition with new number of partitions <= original partitions. During my test, if I

jupyter notebook NameError: name 'sc' is not defined

蓝咒 提交于 2020-01-05 03:38:08
问题 I used the jupyter notebook, pyspark, then, my first command was: rdd = sc.parallelize([2, 3, 4]) Then, it showed that NameError Traceback (most recent call last) <ipython-input-1-c540c4a1d203> in <module>() ----> 1 rdd = sc.parallelize([2, 3, 4]) NameError: name 'sc' is not defined. How to fix this error 'sc' is not defined. 回答1: Have you initialized the SparkContext ? You could try this: #Initializing PySpark from pyspark import SparkContext, SparkConf # #Spark Config conf = SparkConf()

What is spark.python.worker.memory?

人盡茶涼 提交于 2020-01-04 23:07:10
问题 Could anyone give me a more precise description of this Spark parameter and how it effects program execution? I cannot tell exactly what this parameter does "under the hood" from the documentation. 回答1: The parameter influences the memory limit for Python workers. If the RSS of a Python worker process is larger than the memory limit, then it will spill data from memory to disk, which will reduce the memory utilization but is generally an expensive operation. Note that this value applies per

Cassandra/Spark showing incorrect entries count for large table

风流意气都作罢 提交于 2020-01-04 09:23:31
问题 I am trying to use spark to process a large cassandra table (~402 million entries and 84 columns) but I am getting inconsistent results. Initially the requirement was to copy some columns from this table to another table. After copying the data, I noticed that some entries in the new table were missing. To verify that I took count of the large source table but I am getting different values each time. I tried the queries on a smaller table (~7 million records) and the results were fine.

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

烂漫一生 提交于 2020-01-04 08:22:05
问题 With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark: kafka_app.py: from pyspark.sql import SparkSession spark=SparkSession.builder.appName("TestKakfa").getOrCreate() kafka=spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers","localhost:6667") \ .option("subscribe","mytopic").load() I launched the app in the following way: ./bin/spark-submit kafka_app.py --master local[4] --jars spark-streaming-kafka-0-10