pyspark

Ambiguous behavior while adding new column to StructType

冷暖自知 提交于 2020-01-06 09:54:47
问题 I defined a function in PySpark which is- def add_ids(X): schema_new = X.schema.add("id_col", LongType(), False) _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new) cols_arranged = [_X.columns[-1]] + _X.columns[0:len(_X.columns) - 1] return _X.select(*cols_arranged) In the function above, I'm creating a new column(with the name of id_col ) that gets appended to the dataframe which is basically just the index number of each row and it finally moves the id_col to the

java.io.IOException: Cannot run program “python”: CreateProcess error=2, The system cannot find the file specified

不打扰是莪最后的温柔 提交于 2020-01-06 08:23:56
问题 I configured eclipse with pyspark i am using latest version of SPARK and PYTHON. when i try to code something and run. i get below error. java.io.IOException: Cannot run program "python": CreateProcess error=2, The system cannot find the file specified code i have written is below ''' Created on 23-Dec-2017 @author: lenovo ''' from pyspark import SparkContext,SparkConf from builtins import int #from org.spark.com.PySparkDemo import data from pyspark.sql import Row from pyspark.sql.context

SPARK 2.2.2 - Joining multiple RDDs giving out of memory excepton. Resulting RDD has 124 columns. What should be the optimal joining method?

早过忘川 提交于 2020-01-06 07:13:59
问题 I have a file which has multiple values for each phone number. for example : phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 attribute3 123445 delhi airtel 1.0 info1 info2 info3 1.1 info4 info5 info6 987654 bhopal idea 1.1 info1 info2 info3 1.4 info4 info5 info6 123445 delhi airtel 1.3 info1 info2 info3 1.0 info4 info5 info6 What my expected output is : for each phone number select minimum P1 and it's corresponding attribute values. As my

How to check the Boolean condition from the another Dataframe

僤鯓⒐⒋嵵緔 提交于 2020-01-06 07:07:29
问题 I have three DF first is base df second is behavior df and third is rule df Base df: +---+----+------+ | ID|Name|Salary| +---+----+------+ | 1| A| 100| | 2| B| 200| | 3| C| 300| | 4| D| 1000| | 5| E| 500| +---+----+------+ Behavior DF: +----+---------+------+ |S.NO|Operation|Points| +----+---------+------+ | 1| a AND b| 100| | 2| a OR b| 200| | 3|otherwise| 0| +----+---------+------+ Rule DF: +----+-----+------+------------+-----+ |RULE|Table| col| operation|value| +----+-----+------+--------

How to check the Boolean condition from the another Dataframe

大兔子大兔子 提交于 2020-01-06 07:06:10
问题 I have three DF first is base df second is behavior df and third is rule df Base df: +---+----+------+ | ID|Name|Salary| +---+----+------+ | 1| A| 100| | 2| B| 200| | 3| C| 300| | 4| D| 1000| | 5| E| 500| +---+----+------+ Behavior DF: +----+---------+------+ |S.NO|Operation|Points| +----+---------+------+ | 1| a AND b| 100| | 2| a OR b| 200| | 3|otherwise| 0| +----+---------+------+ Rule DF: +----+-----+------+------------+-----+ |RULE|Table| col| operation|value| +----+-----+------+--------

Nested dynamic schema not working while parsing JSON using pyspark

折月煮酒 提交于 2020-01-06 06:55:15
问题 I am trying to extract certain parameters from a nested JSON (having dynamic schema) and generate a spark dataframe using pyspark. My code works perfectly for level 1 (key:value) but fails get independent columns for each (key:value) pair that are a part of nested JSON. JSON schema sample Note - This is not the exact schema. Its just to give the idea of nested nature of the schema { "tweet": { "text": "RT @author original message" "user": { "screen_name": "Retweeter" }, "retweeted_status": {

Spark 2.4.0 dependencies to write to AWS Redshift

◇◆丶佛笑我妖孽 提交于 2020-01-06 06:55:00
问题 I'm struggling to find the correct packages dependency and their relative version to write to a Redshfit DB with a Pyspark micro-batch approach. What are the correct dependencies to achieve this goal? 回答1: As suggested from AWS tutorial is necessary to provide a JDBC driver wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar After this jar has been downloaded and make it available to the spark-submit command, this is how I provided

Spark 2.4.0 dependencies to write to AWS Redshift

霸气de小男生 提交于 2020-01-06 06:54:05
问题 I'm struggling to find the correct packages dependency and their relative version to write to a Redshfit DB with a Pyspark micro-batch approach. What are the correct dependencies to achieve this goal? 回答1: As suggested from AWS tutorial is necessary to provide a JDBC driver wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC4-no-awssdk-1.2.20.1043.jar After this jar has been downloaded and make it available to the spark-submit command, this is how I provided

ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063

六眼飞鱼酱① 提交于 2020-01-06 06:43:41
问题 I am presently working with ASN 1 Decoder.I will be getting a Hex decimal code from producer and i will be collecting it in consumer. Then after i will be converting the hex code to RDD and then pass the hex value RDD to another function with in same class Decode_Module and will be using python asn1 decoder to decode the hex data and return it back and print it. I don't understand whats wrong with my code.I have already installed my asn1 parser dependencies in worker nodes too. Any wrong with

ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063

Deadly 提交于 2020-01-06 06:43:04
问题 I am presently working with ASN 1 Decoder.I will be getting a Hex decimal code from producer and i will be collecting it in consumer. Then after i will be converting the hex code to RDD and then pass the hex value RDD to another function with in same class Decode_Module and will be using python asn1 decoder to decode the hex data and return it back and print it. I don't understand whats wrong with my code.I have already installed my asn1 parser dependencies in worker nodes too. Any wrong with