pyspark

Specify options for the jvm launched by pyspark

夙愿已清 提交于 2019-12-20 03:01:10
问题 How /where are the jvm options used by the pyspark script when launching the jvm it connects to specified? I am specifically interested in specifying jvm debugging options e.g. -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 Thanks. 回答1: pyspark uses the bin/spark-class script to start the client that you see running in your terminal / console. You can just append whatever options necessary to JAVA_OPTS : JAVA_OPTS="$JAVA_OPTS -Xmx=2g -Xms=1g -agentlib:jdwp=transport=dt

Structured Streaming error py4j.protocol.Py4JNetworkError: Answer from Java side is empty

北城余情 提交于 2019-12-20 02:57:19
问题 I'm trying to make a left outer join between two Kafka Stream using PySpark and Structured Streaming (Spark 2.3). import os import time from pyspark.sql.types import * from pyspark.sql.functions import from_json, col, struct, explode, get_json_object from ast import literal_eval from pyspark.sql import SparkSession from pyspark.sql.functions import expr os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell' spark = SparkSession \

Working with jdbc jar in pyspark

偶尔善良 提交于 2019-12-20 02:36:05
问题 I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually. I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches: pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars

Parsing a csv file in Pyspark using Spark inbuilt functions or methods

折月煮酒 提交于 2019-12-20 02:35:39
问题 I am using spark version 2.3 and working on some poc wherein, I have to load some bunch of csv files to spark dataframe. Considering below csv as a sample which I need to parse and load it into dataframe. The given csv has multiple bad records which needs to be identified. id,name,age,loaded_date,sex 1,ABC,32,2019-09-11,M 2,,33,2019-09-11,M 3,XYZ,35,2019-08-11,M 4,PQR,32,2019-30-10,M #invalid date 5,EFG,32, #missing other column details 6,DEF,32,2019/09/11,M #invalid date format 7,XYZ,32,2017

Add new rows to pyspark Dataframe

时光怂恿深爱的人放手 提交于 2019-12-20 01:40:44
问题 Am very new pyspark but familiar with pandas. I have a pyspark Dataframe # instantiate Spark spark = SparkSession.builder.getOrCreate() # make some test data columns = ['id', 'dogs', 'cats'] vals = [ (1, 2, 0), (2, 0, 1) ] # create DataFrame df = spark.createDataFrame(vals, columns) wanted to add new Row (4,5,7) so it will output: df.show() +---+----+----+ | id|dogs|cats| +---+----+----+ | 1| 2| 0| | 2| 0| 1| | 4| 5| 7| +---+----+----+ 回答1: As thebluephantom has already said union is the way

How to filter dstream using transform operation and external RDD?

大憨熊 提交于 2019-12-20 01:04:22
问题 I used transform method in a similar use case as described in Transform Operation section of Transformations on DStreams: spamInfoRDD = sc.pickleFile(...) # RDD containing spam information # join data stream with spam information to do data cleaning cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...)) My code is as follows: sc = SparkContext("local[4]", "myapp") ssc = StreamingContext(sc, 5) ssc.checkpoint('hdfs://localhost:9000/user/spark/checkpoint/') lines =

maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml

亡梦爱人 提交于 2019-12-19 19:50:06
问题 Background: I'm doing a simple binary classification, using RandomForestClassifier from pyspark.ml. Before feeding the data to training, I managed to use VectorIndexer to decide whether features would be numerical or categorical by providing the argument maxCategories. Problem: Even if I have used the VectorIndexer with maxCategories setting to 30, I was still getting an error during training pipeline: An error occurred while calling o15371.fit. : java.lang.IllegalArgumentException:

Scan a Hadoop Database table in Spark using indices from an RDD

烈酒焚心 提交于 2019-12-19 12:51:16
问题 So if there is a table in the database shown as below: Key2 DateTimeAge AAA1 XXX XXX XXX AAA2 XXX XXX XXX AAA3 XXX XXX XXX AAA4 XXX XXX XXX AAA5 XXX XXX XXX AAA6 XXX XXX XXX AAA7 XXX XXX XXX AAA8 XXX XXX XXX BBB1 XXX XXX XXX BBB2 XXX XXX XXX BBB3 XXX XXX XXX BBB4 XXX XXX XXX BBB5 XXX XXX XXX CCC1 XXX XXX XXX CCC2 XXX XXX XXX CCC3 XXX XXX XXX CCC4 XXX XXX XXX CCC5 XXX XXX XXX CCC6 XXX XXX XXX CCC7 XXX XXX XXX DDD1 XXX XXX XXX DDD2 XXX XXX XXX DDD3 XXX XXX XXX DDD4 XXX XXX XXX DDD5 XXX XXX XXX

How can I use a function in dataframe withColumn function in Pyspark?

老子叫甜甜 提交于 2019-12-19 12:09:28
问题 I have the some dictionaries and a function defined: dict_TEMPERATURE = {(0, 70): 'Low', (70.01, 73.99): 'Normal-Low',(74, 76): 'Normal', (76.01, 80): 'Normal-High', (80.01, 300): 'High'} ... hierarchy_dict = {'TEMP': dict_TEMPERATURE, 'PRESS': dict_PRESSURE, 'SH_SP': dict_SHAFT_SPEED, 'POI': dict_POI, 'TRIG': dict_TRIGGER} def function_definition(valor, atributo): dict_atributo = hierarchy_dict[atributo] valor_generalizado = None if isinstance(valor, (int, long, float, complex)): for key,

How can I use a function in dataframe withColumn function in Pyspark?

蓝咒 提交于 2019-12-19 12:08:18
问题 I have the some dictionaries and a function defined: dict_TEMPERATURE = {(0, 70): 'Low', (70.01, 73.99): 'Normal-Low',(74, 76): 'Normal', (76.01, 80): 'Normal-High', (80.01, 300): 'High'} ... hierarchy_dict = {'TEMP': dict_TEMPERATURE, 'PRESS': dict_PRESSURE, 'SH_SP': dict_SHAFT_SPEED, 'POI': dict_POI, 'TRIG': dict_TRIGGER} def function_definition(valor, atributo): dict_atributo = hierarchy_dict[atributo] valor_generalizado = None if isinstance(valor, (int, long, float, complex)): for key,