pyspark

cannot start spark history server

点点圈 提交于 2020-12-03 07:49:43
问题 I am running spark on yarn cluster. I tried to start the history server ./start-history-server.sh but got the following errors. starting org.apache.spark.deploy.history.HistoryServer, logging to /home/abc/spark/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark-abc-org.apache.spark.deploy.history.HistoryServer-1-abc-Efg.out failed to launch org.apache.spark.deploy.history.HistoryServer: at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:47) ... 6 more full log in

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

陌路散爱 提交于 2020-12-03 07:37:16
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

家住魔仙堡 提交于 2020-12-03 07:35:42
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

对着背影说爱祢 提交于 2020-12-03 07:32:47
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

老子叫甜甜 提交于 2020-12-03 07:32:39
问题 I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0 from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str']) df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date')) df2

Manually calling spark's garbage collection from pyspark

空扰寡人 提交于 2020-12-02 06:28:34
问题 I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions. Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling: import gc collected = gc.collect() print "Garbage

Manually calling spark's garbage collection from pyspark

邮差的信 提交于 2020-12-02 06:27:44
问题 I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions. Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling: import gc collected = gc.collect() print "Garbage

Select columns in Pyspark Dataframe

寵の児 提交于 2020-11-30 06:15:18
问题 I am looking for a way to select columns of my dataframe in pyspark. For the first row, I know I can use df.first() but not sure about columns given that they do not have column names. I have 5 columns and want to loop through each one of them. +--+---+---+---+---+---+---+ |_1| _2| _3| _4| _5| _6| _7| +--+---+---+---+---+---+---+ |1 |0.0|0.0|0.0|1.0|0.0|0.0| |2 |1.0|0.0|0.0|0.0|0.0|0.0| |3 |0.0|0.0|1.0|0.0|0.0|0.0| 回答1: Try something like this: df.select([c for c in df.columns if c in ['_2','

Get data from nested json in kafka stream pyspark

梦想的初衷 提交于 2020-11-29 23:59:53
问题 I have a kafka producer sending large amounts of data in the format of { '1000': { '3': { 'seq': '1', 'state': '2', 'CMD': 'XOR' } }, '1001': { '5': { 'seq': '2', 'state': '2', 'CMD': 'OR' } }, '1003': { '5': { 'seq': '3', 'state': '4', 'CMD': 'XOR' } } } .... the data I want is in the final loop: {'seq': '1', 'state': '2', 'CMD': 'XOR'} and the keys in the loops above ('1000' and '3') are variable. Please note that the above values are only for example. the original dataset is huge with lots

Get data from nested json in kafka stream pyspark

孤者浪人 提交于 2020-11-29 23:56:39
问题 I have a kafka producer sending large amounts of data in the format of { '1000': { '3': { 'seq': '1', 'state': '2', 'CMD': 'XOR' } }, '1001': { '5': { 'seq': '2', 'state': '2', 'CMD': 'OR' } }, '1003': { '5': { 'seq': '3', 'state': '4', 'CMD': 'XOR' } } } .... the data I want is in the final loop: {'seq': '1', 'state': '2', 'CMD': 'XOR'} and the keys in the loops above ('1000' and '3') are variable. Please note that the above values are only for example. the original dataset is huge with lots